Skip to content

docs: add Grafana dashboard screenshot#8

Merged
streamer45 merged 1 commit intomainfrom
grafana
Dec 29, 2025
Merged

docs: add Grafana dashboard screenshot#8
streamer45 merged 1 commit intomainfrom
grafana

Conversation

@streamer45
Copy link
Copy Markdown
Owner

Summary

Including a link to the Grafana dashboard file + screenshot.

@streamer45 streamer45 self-assigned this Dec 29, 2025
@streamer45 streamer45 merged commit ff1727a into main Dec 29, 2025
12 checks passed
@streamer45 streamer45 deleted the grafana branch December 29, 2025 13:42
staging-devin-ai-integration bot pushed a commit that referenced this pull request Feb 24, 2026
… recv_from_any_slot

Introduces SlotRecvResult enum with Frame/ChannelClosed/NonVideo/Empty variants.
The main loop now removes closed slots and skips non-video packets instead of
treating any single channel close as all-inputs-closed.

Also adds a comment about dropped in-flight results on shutdown (Fix #6).

Optimizes overlay cloning by using Arc<[Arc<DecodedOverlay>]> instead of
Vec<Arc<DecodedOverlay>> so cloning into the work item each frame is a single
ref-count bump instead of a full Vec clone (Fix #8).

Fixes: #1, #6, #8
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>
staging-devin-ai-integration bot added a commit that referenced this pull request Mar 13, 2026
Implements all 13 actionable findings from the video feature review
(finding #11 skipped — would require core PixelFormat serde changes):

WebM muxer (webm.rs):
- Add shutdown/cancellation handling to the receive loop via
  tokio::select! on context.control_rx, matching the pattern used
  by the OGG muxer and colorbars node (fix #1, important)
- Remove dead chunk_size config field and DEFAULT_CHUNK_SIZE constant;
  update test that referenced it (fix #2, important)
- Make Seek on Live MuxBuffer return io::Error(Unsupported) instead of
  warn-and-clamp to fail fast on unexpected seek calls (fix #3, important)
- Add comment noting VP9 CodecPrivate constants must stay in sync with
  encoder config in video/mod.rs (fix #4, important)
- Make OpusHead pre_skip configurable via WebMMuxerConfig::opus_preskip_samples
  instead of always using the hardcoded constant (fix #6, minor)
- Group mux_frame loose parameters into MuxState struct (fix #12, nit)
- Fix BitReader::read() doc comment range 1..=16 → 1..=32 (fix #14, nit)

VP9 codec (vp9.rs):
- Add startup-time ABI assertion verifying vpx_codec_vp9_cx/dx return
  non-null VP9 interfaces (fix #5, minor)

Colorbars (colorbars.rs):
- Add draw_time_use_pts config option to stamp PTS instead of wall-clock
  time, more useful for A/V timing debugging (fix #7, minor)
- Document studio-range assumption in SMPTE bar YUV table comment with
  note explaining why white Y=180 (fix #13, nit)

OGG muxer (ogg.rs):
- Remove dead is_first_packet field and its no-op toggle (fix #10, minor)

Tests (tests.rs):
- Add File mode (WebMStreamingMode::File) test exercising the seekable
  temp-file code path (fix #8, minor)
- Add edge-case tests: non-keyframe first video packet and truncated/
  corrupt VP9 header — verify no panics (fix #9, minor)

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Signed-off-by: bot_apk <apk@cognition.ai>
Co-Authored-By: Staging-Devin AI <166158716+staging-devin-ai-integration[bot]@users.noreply.github.com>
streamer45 pushed a commit that referenced this pull request Mar 13, 2026
Implements all 13 actionable findings from the video feature review
(finding #11 skipped — would require core PixelFormat serde changes):

WebM muxer (webm.rs):
- Add shutdown/cancellation handling to the receive loop via
  tokio::select! on context.control_rx, matching the pattern used
  by the OGG muxer and colorbars node (fix #1, important)
- Remove dead chunk_size config field and DEFAULT_CHUNK_SIZE constant;
  update test that referenced it (fix #2, important)
- Make Seek on Live MuxBuffer return io::Error(Unsupported) instead of
  warn-and-clamp to fail fast on unexpected seek calls (fix #3, important)
- Add comment noting VP9 CodecPrivate constants must stay in sync with
  encoder config in video/mod.rs (fix #4, important)
- Make OpusHead pre_skip configurable via WebMMuxerConfig::opus_preskip_samples
  instead of always using the hardcoded constant (fix #6, minor)
- Group mux_frame loose parameters into MuxState struct (fix #12, nit)
- Fix BitReader::read() doc comment range 1..=16 → 1..=32 (fix #14, nit)

VP9 codec (vp9.rs):
- Add startup-time ABI assertion verifying vpx_codec_vp9_cx/dx return
  non-null VP9 interfaces (fix #5, minor)

Colorbars (colorbars.rs):
- Add draw_time_use_pts config option to stamp PTS instead of wall-clock
  time, more useful for A/V timing debugging (fix #7, minor)
- Document studio-range assumption in SMPTE bar YUV table comment with
  note explaining why white Y=180 (fix #13, nit)

OGG muxer (ogg.rs):
- Remove dead is_first_packet field and its no-op toggle (fix #10, minor)

Tests (tests.rs):
- Add File mode (WebMStreamingMode::File) test exercising the seekable
  temp-file code path (fix #8, minor)
- Add edge-case tests: non-keyframe first video packet and truncated/
  corrupt VP9 header — verify no panics (fix #9, minor)

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Signed-off-by: bot_apk <apk@cognition.ai>
Co-authored-by: bot_apk <apk@cognition.ai>
Co-authored-by: Staging-Devin AI <166158716+staging-devin-ai-integration[bot]@users.noreply.github.com>
streamer45 added a commit that referenced this pull request Mar 15, 2026
* chore: update roadmap

* feat(video): update packet types, docs, and compatibility rules

* feat(video): make raw video layout explicit + enforce aligned buffers

* feat(webm): extend muxer with VP9 video track support (PR4)

- Add dual input pins: 'audio' (Opus) and 'video' (VP9), both optional
- Add video track via VideoCodecId::VP9 with configurable width/height
- Multiplex audio and video frames using tokio::select! in receive loop
- Track monotonic timestamps across tracks (clamp to last_written_ns)
- Convert timestamps from microseconds to nanoseconds for webm crate
- Dynamic content-type: video/webm;codecs="vp9,opus" | vp9 | opus
- Extract flush logic into flush_output() helper
- Add video_width/video_height to WebMMuxerConfig
- Add MuxTracks struct and webm_content_type() const helper
- Update node registration description
- Add test: VP9 video-only encode->mux produces parseable WebM
- Add test: no-inputs-connected returns error
- Update existing tests to use new 'audio' pin name

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* feat: end-to-end video pipeline support

- YAML compiler: add Needs::Map variant for named pin targeting
- Color Bars Generator: SMPTE I420 source node (video::colorbars)
- MoQ Peer: video input pin, catalog with VP9, track publishing
- Frontend: generalize MSEPlayer for audio/video, ConvertView video support
- Frontend: MoQ video playback via Hang Video.Renderer in StreamView
- Sample pipelines: oneshot (color bars -> VP9 -> WebM) and dynamic (MoQ stream)

Signed-off-by: Devin AI <devin@cognition.ai>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(ui): video-aware ConvertView for no-input pipelines

- Detect pipelines without http_input as no-input (hides upload UI)
- Add checkIfVideoPipeline helper for video pipeline detection
- Update output mode label: 'Play Video' for video pipelines
- Derive isVideoPipeline from pipeline YAML via useMemo

Signed-off-by: Devin AI <devin@cognition.ai>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(server): allow generator-only oneshot pipelines without http_input

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(engine): allow generator-only oneshot pipelines without file_reader

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): enable video feature (vp9 + colorbars) in default features

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix: generator pipeline start signals, video-only content-type, and media-generic UI messages

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix

* feat: add sweep bar animation to colorbars, skip publish for receive-only pipelines

- ColorBarsNode now draws a 4px bright-white vertical bar that sweeps
  across the frame at 4px/frame, making motion clearly visible.
- extractMoqPeerSettings returns hasInputBroadcast so the UI can infer
  whether a pipeline expects a publisher.
- handleTemplateSelect auto-sets enablePublish=false for receive-only
  pipelines (no input_broadcast), skipping microphone access.
- decideConnect respects enablePublish in session mode instead of
  always forcing shouldPublish=true.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf(vp9): configurable encoder deadline (default realtime), avoid unnecessary metadata clones

- Add Vp9EncoderDeadline enum (realtime/good_quality/best_quality) to
  Vp9EncoderConfig, defaulting to Realtime instead of the previous
  hard-coded VPX_DL_BEST_QUALITY.
- Store deadline in Vp9Encoder struct and use it in encode_frame/flush.
- Encoder input task: use .take() instead of .clone() on frame metadata
  since the frame is moved into the channel anyway.
- Decoder decode_packet: peek ahead and only clone metadata when
  multiple frames are produced; move it on the last iteration.
- Encoder drain_packets: same peek-ahead pattern to avoid cloning
  metadata on the last (typically only) output packet.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style: cargo fmt

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* test(e2e): add video pipeline tests for convert and MoQ stream views

- Add verifyVideoPlayback helper for MSEPlayer video element verification
- Add verifyCanvasRendering helper for canvas-based video frame verification
- Add convert view test: select video colorbars template, generate, verify video player
- Add stream view test: create MoQ video session, connect, verify canvas rendering

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix: correct webm_muxer pin name in mixing pipeline and convert button text in asset mode

- mixing.yml: use 'audio' input pin for webm_muxer instead of default 'in' pin
- ConvertView: show 'Convert File' button text when in asset mode (not 'Generate')
- test-helpers: fix prettier formatting

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* refactor(webm-muxer): generic input pins with runtime media type detection

Replace fixed 'audio'/'video' pin names with generic 'in'/'in_1' pins
that accept both EncodedAudio(Opus) and EncodedVideo(VP9). The actual
media type is detected at runtime by inspecting the first packet's
content_type field (video/* → video track, everything else → audio).

This makes the muxer future-proof for additional track types (subtitles,
data channels, etc.) without requiring pin-name changes.

Pin layout is config-driven:
- Default (no video dimensions): single 'in' pin — fully backward
  compatible with existing audio-only pipelines.
- With video_width/video_height > 0: two pins 'in' + 'in_1'.

Updated all affected sample pipelines and documentation.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style: cargo fmt

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* refactor(webm-muxer): connection-time type detection via NodeContext.input_types

Replace packet probing with connection-time media type detection. The graph
builder now populates NodeContext.input_types with the upstream output's
PacketType for each connected pin, so the webm muxer can classify inputs
as audio or video without inspecting any packets.

Changes:
- Add input_types: HashMap<String, PacketType> to NodeContext
- Populate input_types in graph_builder (oneshot pipelines)
- Leave empty in dynamic_actor (connections happen after spawn)
- Refactor WebMMuxerNode::run() to use input_types instead of probing
- Remove first-packet buffering logic from receive loop
- Update all NodeContext constructions in test code
- Update docs to reflect connection-time detection

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* feat(compositor): add video compositor node with dynamic inputs, overlays, and spawn_blocking

Implements the video::compositor node (PR3 from VIDEO_SUPPORT_PLAN.md):

- Dynamic input pins (PinCardinality::Dynamic) for attaching arbitrary
  raw video inputs at runtime
- RGBA8 output canvas with configurable dimensions (default 1280x720)
- Image overlays: decoded once at init via the `image` crate (PNG/JPEG)
- Text overlays: rasterized once per UpdateParams via `tiny-skia`
- Compositing runs in spawn_blocking to avoid blocking the async runtime
- Nearest-neighbor scaling for MVP (bilinear/GPU follow-up)
- Per-layer opacity and rect positioning
- NodeControlMessage::UpdateParams support for live parameter tuning
- Pool-based buffer allocation via VideoFramePool
- Metadata propagation (timestamp, duration, sequence) from first input

New dependencies:
- image 0.25.9 (MIT/Apache-2.0) — PNG/JPEG decoding, features: png, jpeg
- tiny-skia 0.12.0 (BSD-3-Clause) — 2D rendering, pure Rust
- base64 0.22 (MIT/Apache-2.0) — base64 decoding for image overlay data

14 tests covering compositing helpers, config validation, node integration,
metadata preservation, and pool usage.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style: cargo fmt

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor): address review findings and add sample pipeline

- Fix shutdown propagation: add should_stop flag so Shutdown in the
  non-blocking try_recv loop properly breaks the outer loop instead of
  falling through to an extra composite pass.
- Fix canvas resize: remove stale canvas_w/canvas_h locals captured once
  at init; read self.config.width/height directly so UpdateParams
  dimension changes take effect immediately.
- Fix image overlay re-decode: always re-decode image overlays on
  UpdateParams, not only when the count changes (content/rect/opacity
  changes were silently ignored).
- Add video_compositor_demo.yml oneshot sample pipeline: colorbars →
  compositor (with text overlay) → VP9 → WebM → HTTP output.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor): use single needs variant in sample pipeline YAML

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor): remove deeply nested params from sample YAML

serde_saphyr cannot deserialize YAML with 4+ nesting levels inside
params when the top-level type is an untagged enum (UserPipeline).
Text/image overlays with nested rect objects trigger this limitation.

Removed text_overlays from the static sample YAML. Overlays can still
be configured at runtime via UpdateParams (JSON, not serde_saphyr).

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor): add num_inputs for static pin pre-creation in oneshot pipelines

Mirrors the AudioMixerNode pattern: when num_inputs is set in params,
pre-create input pins so the graph builder can wire connections at
startup. Single input uses pin name 'in' (matching YAML convention),
multiple inputs use 'in_0', 'in_1', etc.

The sample pipeline now sets num_inputs: 1 so the compositor declares
the 'in' pin that the graph builder expects.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* feat(compositor): accept I420 inputs and configurable output format

- Colorbars node: add pixel_format config (i420 default, rgba8 supported)
  with RGBA8 generation + sweep bar functions
- Compositor: accept both I420 and RGBA8 inputs (auto-converts I420 to
  RGBA8 internally for compositing via BT.601 conversion)
- Compositor: add output_pixel_format config (rgba8 default, i420 for
  VP9 encoder compatibility) with RGBA8→I420 output conversion
- Sample pipeline: uses I420 colorbars → compositor (output_pixel_format:
  i420) → VP9 encoder → WebM muxer → HTTP output

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor): process every frame instead of draining to latest

The non-blocking try_recv loop was draining all queued frames and keeping
only the latest per slot. When spawn_blocking compositing was slower than
the producer (colorbars at 90 frames), intermediate frames were dropped,
resulting in only 2 output frames.

Changed to take at most one frame per slot per loop iteration so every
produced frame is composited and forwarded downstream.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* feat(compositor): auto-PiP positioning and two-input sample pipeline

- Non-first layers without explicit layers config are auto-positioned as
  PiP windows (bottom-right corner, 1/3 canvas size, 0.9 opacity)
- Sample pipeline now uses two colorbars sources: 640x480 I420 background
  + 320x240 RGBA8 PiP overlay, making compositing visually obvious

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf(compositor): move all pixel format conversions into spawn_blocking

Previously I420→RGBA8 (input) and RGBA8→I420 (output) conversions ran
on the async runtime, blocking it for ~307K pixel iterations per frame
per input. Now all conversions run inside the spawn_blocking task
alongside compositing, keeping the async runtime free for channel ops.

- Removed ensure_rgba8() calls from frame receive paths
- Store raw frames (I420 or RGBA8) in InputSlot.latest_frame
- Added pixel_format field to LayerSnapshot
- composite_frame() converts I420→RGBA8 on-the-fly per layer
- RGBA8→I420 output conversion also runs inside spawn_blocking

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf(compositor): parallelize with rayon and use persistent blocking thread

- Add rayon as optional dependency gated on compositor feature
- Parallelize scale_blit_rgba() across rows using rayon::par_chunks_mut
- Split blit into blit_row_opaque (no alpha multiply) and blit_row_alpha
- Parallelize i420_to_rgba8() and rgba8_to_i420() row processing
- Replace per-frame spawn_blocking with persistent blocking thread via channels
- Add CompositeWorkItem/CompositeResult types for channel communication

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* refactor(compositor): modularize into config, overlay, pixel_ops, and kernel sub-modules

Split the 1700+ line compositor.rs into focused sub-modules:
- config.rs: configuration types, validation, pixel format parsing
- overlay.rs: DecodedOverlay, image decoding, text rasterization
- pixel_ops.rs: scale_blit_rgba, blit_row*, blit_overlay, i420/rgba8 conversion
- kernel.rs: LayerSnapshot, CompositeWorkItem/Result, composite_frame
- mod.rs: CompositorNode, run loop, registration, tests

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf(compositor): 5 high-impact video compositing optimizations

1. Pool intermediate color conversion buffers: i420_to_rgba8_buf and
   rgba8_to_i420_buf write into caller-provided buffers instead of
   allocating fresh Vec's every frame (~34 MB/s allocation churn eliminated).
   Persistent scratch buffers are reused across frames in the compositing thread.

2. I420 pass-through: when a single I420 layer fills the full canvas with
   no overlays and output is I420, skip the entire I420→RGBA8→I420 round-trip.

3. Vectorize inner loops: process 4 pixels at a time in color conversion
   loops with hoisted row bases to help LLVM auto-vectorize.

4. Arc overlays: wrap DecodedOverlay in Arc so per-frame clones into the
   CompositeWorkItem are cheap reference-count bumps instead of deep copies.

5. Integer-only alpha blending: replace f32 blend math in blit_row_opaque
   and blit_row_alpha with fixed-point integer arithmetic using the
   ((val + (val >> 8)) >> 8) fast approximation of division by 255.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style: apply cargo fmt formatting

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf(compositor): fix regression — replace broken chunking with slice iterators

The previous 4-pixel chunking approach (for chunk in 0..chunks { for i in 0..4 })
added MORE Range::next overhead instead of helping vectorization.

Fixes:
- i420_to_rgba8_buf: use chunks_exact_mut(4) on output + sub-sliced input
  planes to eliminate Range::next calls AND bounds checks entirely
- rgba8_to_i420_buf Y plane: use chunks_exact(4) on input RGBA row with
  enumerate() instead of range-based indexing
- I420 passthrough: return layer index instead of Arc, copy data into
  pooled buffer directly (Arc::try_unwrap always failed since the
  original frame still holds a ref, causing a wasteful .to_vec())

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style: apply cargo fmt formatting

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf(compositor): revert chunks_exact to simple for-loops

chunks_exact(4).enumerate() added MORE overhead than Range::next:
- ChunksExact::next -> split_at_checked -> split_at_unchecked -> from_raw_parts
  chain consumed ~33% CPU vs original ~14% from Range::next.
- Enumerate::next alone was 15.33% of total CPU.

Revert to simple 'for col in 0..w' with pre-computed row bases.
The buffer pooling (optimization #1) is confirmed working well
via DHAT: ~1GB alloc churn eliminated.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf(compositor): eliminate double-copy in I420 output path

Write rgba8_to_i420_buf directly into the pooled output buffer instead
of going through an intermediate scratch buffer + copy_from_slice.
This removes a full extra memcpy of the I420 data every frame.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* bench: add compositor pipeline benchmark for profiling

Adds a standalone benchmark binary that runs the compositing oneshot
pipeline (colorbars → compositor → vp9 → webm → http_output) and
reports wall-clock time, throughput (fps), per-frame latency, and
output bytes.

Supports CLI args for profiling flexibility:
  --width, --height, --fps, --frames, --iterations

Usage: cargo bench -p streamkit-engine --bench compositor_pipeline
  cargo bench -p streamkit-engine --bench compositor_pipeline -- --frames 300 --width 1280 --height 720
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix: resolve clippy lint errors in video nodes

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix: resolve remaining clippy lint errors in video nodes

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix: make lint pass after metadata updates

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* chore: update native plugin lockfiles

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(webm): skip intermediate flushes in File mode to prevent finalize failure

In File mode, the SharedPacketBuffer was being drained during the mux
loop via flush_output(). When segment.finalize() subsequently tried to
seek backward to backpatch the EBML header (duration, cues), those
bytes had already been moved out of the buffer, causing finalize to
fail.

Fix: guard flush_output calls with an is_file_mode flag so the entire
buffer remains intact until finalize() completes. The post-finalize
flush already handles emitting the complete finalized bytes.

Also adds libvpx-dev to the CI runner's apt packages (lint, test, build
jobs) so the vp9 feature compiles on GitHub Actions.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(webm): use Live mode for VP9 mux test to avoid unbounded memory

The previous fix kept the entire WebM buffer in memory during File mode
to allow finalize() backward seeks. This would cause unbounded memory
growth for long streams.

Instead, switch the test to Live mode (the default and intended
streaming use case). Live mode uses a non-seek writer with zero-copy
streaming drain, keeping memory bounded. The test assertions (EBML
header, content type) don't require File mode.

Reverts the is_file_mode flush guard from the previous commit since
it's no longer needed.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style: apply cargo fmt formatting

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor): handle non-video packets and single channel close in recv_from_any_slot

Introduces SlotRecvResult enum with Frame/ChannelClosed/NonVideo/Empty variants.
The main loop now removes closed slots and skips non-video packets instead of
treating any single channel close as all-inputs-closed.

Also adds a comment about dropped in-flight results on shutdown (Fix #6).

Optimizes overlay cloning by using Arc<[Arc<DecodedOverlay>]> instead of
Vec<Arc<DecodedOverlay>> so cloning into the work item each frame is a single
ref-count bump instead of a full Vec clone (Fix #8).

Fixes: #1, #6, #8
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(webm): restore streaming-mode guard in flush_output

Pass streaming_mode into flush_output and skip all intermediate flushes
in File mode. In File mode the writer supports seeking and may back-patch
segment sizes/cues, so draining the buffer after every frame would send
stale bytes that get overwritten later, corrupting the output.

Fix #2

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(moq): remove hardcoded catalog dimensions and add clean shutdown

Thread video_width and video_height from MoqPeerConfig through to
create_and_publish_catalog instead of hardcoding 640x480. Add fields
to BidirectionalTaskConfig so the bidirectional path also gets the
correct dimensions.

Add clean shutdown when both audio and video pipeline inputs close:
each input branch now explicitly handles None (channel closed), sets
its rx to None, and breaks when both are done.

Fixes #3, #4

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(vp9): improve encoder/decoder allocations and add shutdown comments

- Change next_pts duration default from 0 to 1 so libvpx rate-control
  always sees a non-zero duration (Fix #5).
- Add comment about data loss on explicit encoder shutdown (Fix #7).
- Use Bytes::copy_from_slice in drain_packets instead of .to_vec() +
  Bytes::from(), avoiding an intermediate Vec allocation per encoded
  packet (Fix #9).
- Use Vec::with_capacity(1) in decode_packet since most VP9 packets
  produce exactly one frame, avoiding a heap alloc in the common
  case (Fix #10).

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* refactor(video): extract shared parse_pixel_format utility

Move the duplicated parse_pixel_format function from colorbars.rs and
compositor/config.rs into video/mod.rs as a shared utility. Both modules
now re-export it from the parent module.

Also includes cargo fmt formatting fixes from the previous commits.

Fix #11

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix: sweep bar clipping, WebM auto-detect dims, output filename

- colorbars: clip sweep bar at frame edge instead of wrapping via modulo,
  preventing the bar from appearing split across PiP boundaries
- webm: auto-detect video dimensions from first VP9 keyframe when
  video_width/video_height are not configured (both 0). Parses the VP9
  uncompressed header to extract width/height, buffers the first packet,
  and replays it after segment creation. This eliminates the need to
  manually keep muxer dimensions in sync with the upstream encoder.
- ui: change download filename from 'converted_audio_converted.webm' to
  'output.[ext]' when no source file is available; keep the
  '{name}_converted' pattern only when a real input file exists

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style: apply cargo fmt to webm muxer

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf: collapse SharedPacketBuffer mutexes, bump pool max, zero-alloc compositor poll

- Collapse triple-mutex SharedPacketBuffer into single Mutex<BufferState>
  to eliminate lock-ordering risk between cursor, last_sent_pos, and
  base_offset.

- Bump DEFAULT_VIDEO_MAX_BUFFERS_PER_BUCKET from 8 to 16 to reduce pool
  misses in deep pipelines (colorbars → compositor → encoder → muxer →
  transport can easily have 8+ frames in flight).

- Replace select_all + Vec<Box<Pin<Future>>> in compositor
  recv_from_any_slot with zero-allocation poll_fn that calls poll_recv
  directly on each slot receiver.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style: apply cargo fmt

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(sample): add pacer node to video compositor demo for real-time playback

Without the pacer, colorbars in batch mode (frame_count > 0) generates
all frames as fast as possible with no real-time pacing. The WebM muxer
flushes each frame immediately in live mode, flooding the http_output
with the entire stream faster than real-time, causing browsers to buffer
heavily.

Insert core::pacer between webm_muxer and http_output to release muxed
chunks at the rate indicated by their duration_us metadata (~33ms per
frame at 30fps), matching real-time playback expectations.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(engine): walk connection graph backwards for content-type resolution

When passthrough-style nodes (core::pacer, core::passthrough,
core::telemetry_tap, etc.) are inserted between the content-producing
node and http_output, the oneshot runner previously only checked the
immediate predecessor of http_output for content_type(). Since those
utility nodes return None, the response fell back to
application/octet-stream, causing browsers to misdetect the stream.

Now the runner walks backwards through the connection graph until it
finds a node that declares a content_type, so inserting any number
of passthrough nodes before http_output preserves the correct MIME.

Also suppresses clippy::significant_drop_tightening on the
SharedPacketBuffer methods where the mutex guard intentionally spans
the entire take-trim-update / seek-compute sequence.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor): sort input slots by pin name for deterministic layer ordering

HashMap::drain() has non-deterministic iteration order, so the compositor
slots could randomly swap which input becomes the background (idx 0) vs.
the PiP overlay (idx > 0).  This caused two user-visible issues:

1. Background/PiP resolution swap: the 1280×720 colorbars sometimes
   ended up in the PiP slot and the 320×240 in the background slot.

2. Sweep bar appearing to extend beyond PiP boundaries: a consequence
   of the resolution swap — the large-resolution sweep bar interacts
   visually with the small-resolution background at the PiP boundary.

Fix: sort the drained inputs numerically by their 'in_N' pin suffix
before populating the slots Vec, so in_0 always comes before in_1.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* feat(compositor): add z_index to LayerConfig for explicit layer stacking order

Adds a z_index field (i32, default 0) to LayerConfig and LayerSnapshot.
Layers are sorted by z_index before compositing — lower values are drawn
first (bottom of the stack).  Ties are broken by the original slot order.

Auto-PiP layers without explicit config get z_index = slot index (so
background = 0, first PiP = 1, etc.).  Explicit LayerConfig entries can
override this to reorder layers at will, including via UpdateParams at
runtime.

This decouples visual stacking order from pin connection order, which is
the correct separation of concerns for a compositor.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style: apply cargo fmt to compositor z_index changes

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf: review fixes — temp file for WebM File mode, Arc unwrap, rayon threshold, saturating sub, config struct

- Fix #6: Use saturating_sub for MoQ Peer subscriber count to prevent underflow
- Fix #11: Skip memcpy in I420 passthrough when Arc has sole ownership (try_unwrap)
- Fix #12: Add minimum-row threshold for rayon parallel pixel ops (skip dispatch for small canvases)
- Fix #19: WebM File mode uses on-disk temp file (FileBackedBuffer) instead of unbounded in-memory Vec
- Fix #24: Group subscriber params into SubscriberMediaConfig struct, reducing argument counts
- Add MuxBuffer enum to unify Live (SharedPacketBuffer) and File (FileBackedBuffer) buffer types
- Add tempfile to webm feature gate in Cargo.toml

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* feat(compositor): sweep_bar toggle, fontdue text rendering, rotation, signed coords

- Add sweep_bar bool to ColorBarsConfig (default true) to gate the
  animated vertical bar; set false on background to prevent visual
  bleed through PiP overlays.

- Replace placeholder rectangle glyphs with real font rendering via
  fontdue 0.9.  Supports font_path, font_data_base64, and falls back
  to system DejaVu Sans.  Coverage-based alpha-over compositing.

- Change Rect.x/y from u32 to i32 for signed (off-screen) positioning.
  scale_blit_rgba now clips negative source offsets correctly.

- Add rotation_degrees (f32, clockwise) to LayerConfig/LayerSnapshot.
  New scale_blit_rgba_rotated() uses inverse-affine mapping with
  nearest-neighbor sampling over the axis-aligned bounding box.

- Update oneshot demo YAML: sweep_bar false on background, explicit
  layer config with PiP rect at (380,220) 240x180 rotated 15 degrees.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style: apply cargo fmt formatting

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* feat(demo): add text overlay layer with bundled DejaVu Sans font

Add a third layer to the compositor demo: a 'StreamKit Demo' text
overlay rendered with fontdue using the bundled DejaVu Sans font.

- Bundle DejaVu Sans TTF in assets/fonts/ with its Bitstream Vera
  license file.
- Update demo YAML to include text_overlays with font_path pointing
  to the bundled font, white text at (20,20) 32px.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix: work around serde_saphyr untagged enum limitation for nested YAML

serde_saphyr fails to deserialize deeply nested structures (sequences
of objects with nested objects, maps with nested objects) when they
appear inside #[serde(untagged)] enums.

Add parse_yaml() helper to streamkit_api::yaml that uses a two-step
approach: YAML -> serde_json::Value -> UserPipeline.  This bypasses
the serde_saphyr limitation by using serde_json's deserializer for the
untagged enum dispatch.

Update all three call sites that directly deserialized YAML into
UserPipeline:
  - samples.rs: parse_pipeline_metadata()
  - server.rs: create_session_handler()
  - server.rs: parse_config_field()

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style: apply cargo fmt to server.rs

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(demo): move PiP overlay positioning to the left

Move the PiP overlay x-coordinate from 380 to 100 so the main canvas
blue bar (rightmost SMPTE bar) remains clearly visible and is not
obscured by the overlapping PiP layer.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* refactor(colorbars): remove sweep_bar parameter entirely

Remove the sweep_bar config field, its default function, and both
draw_sweep_bar_i420/draw_sweep_bar_rgba8 rendering functions.  Also
remove the sweep_bar: false reference from the compositor demo YAML.

The sweep bar feature is being simplified out for now.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* feat(compositor): add draw_time option with millisecond precision

When draw_time is true the compositor renders the current wall-clock
time (HH:MM:SS.mmm) in the bottom-left corner of every composited
frame using a pre-loaded monospace font (DejaVu Sans Mono).

- Add draw_time and draw_time_font_path fields to CompositorConfig
- Add load_font_from_path() and rasterize_text_with_font() to overlay
- Pre-load font once during init; rasterize per frame in the main loop
- Pull DejaVu Sans Mono (royalty-free) into assets/fonts/
- Enable draw_time in the demo pipeline YAML

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style: apply cargo fmt to draw_time changes

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor): add edge anti-aliasing for rotated layers

Replace the hard binary contains() inside/outside test in
scale_blit_rgba_rotated() with a signed-distance-to-edge approach.

For each destination pixel the signed distance to all four edges of
the un-rotated rectangle is computed.  Pixels well inside (dist >= 1)
get full alpha; edge pixels (0 < dist < 1) get fractional coverage
proportional to the distance; pixels outside (dist <= 0) are skipped.

This smooths the staircase zig-zag artifacts on rotated overlay
borders.  The bounding box is also expanded by 1px on each side to
include the anti-aliased fringe.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* refactor(colorbars): move draw_time from compositor to colorbars generator

The draw_time feature belongs in the source frame generator (ColorBarsNode),
not the composition layer, consistent with how sweep_bar was previously
implemented.

- Add draw_time + draw_time_font_path fields to ColorBarsConfig
- Implement per-frame wall-clock stamping (HH:MM:SS.mmm) in ColorBarsNode
  using fontdue, supporting both RGBA8 and I420 pixel formats
- Remove draw_time logic from CompositorConfig/CompositorNode entirely
- Remove unused load_font_from_path and rasterize_text_with_font from overlay
- Add fontdue dependency to the colorbars feature
- Update demo YAML to configure draw_time on colorbars_bg node

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* refactor: deduplicate and improve video subsystem code quality

- Extract shared mux_frame() helper in webm.rs (~120 lines reduced)
- Extract generic codec_forward_loop() for VP9 encoder/decoder (~300 lines)
- Extract shared blit_text_rgba() utility in video/mod.rs
- Parallelize rotated blit with rayon (row-level, RAYON_ROW_THRESHOLD)
- Document packed layout assumption in pixel format conversions
- Share DEFAULT_VIDEO_FRAME_DURATION_US constant (webm + moq peer)
- Share accepted_video_types() in compositor (definition_pins + make_input_pin)

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style: apply cargo fmt

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* feat(ui): add compositor node UI with draggable layer canvas

Add visual compositor node UI that allows users to manipulate
compositor layers on a scaled canvas. Features include:

- Draggable, resizable layer boxes with position/size handles
- Opacity, rotation, and z-index sliders per selected layer
- Zero-render drag via refs + requestAnimationFrame for smooth UX
- Full config updates via new tuneNodeConfig callback
- Staging mode support (batch changes or live updates)
- LIVE indicator matching AudioGainNode pattern

New files:
- useCompositorLayers.ts: Hook for layer state management
- CompositorCanvas.tsx: Visual canvas component
- CompositorNode.tsx: ReactFlow node component

Modified files:
- useSession.ts: Add tuneNodeConfig for full-config updates
- reactFlowDefaults.ts: Register compositor node type
- FlowCanvas.tsx: Add compositor to nodeTypes type
- MonitorView.tsx: Map video::compositor kind, thread onConfigChange
- DesignView.tsx: Map video::compositor kind with defaults

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(ui): collapse unscaled height in compositor canvas via negative margin

CSS transform: scale() does not affect the layout box, causing
the outer container to reserve the full unscaled height (e.g. 720px).
Add marginBottom: canvasHeight * (scale - 1) to collapse the extra
space so the compositor node fits tightly in the ReactFlow canvas.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(ui): map video::compositor type in YAML pipeline parser

The YAML parser hardcoded all non-gain nodes to 'configurable' type,
so compositor nodes imported via YAML would not get the custom
CompositorNode UI. Add the same kind-to-type mapping used in
DesignView and MonitorView.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(ui): enable compositor layer interactions in Design View

- Wire up onParamChange in useCompositorLayers so layers are interactive
  when editing pipelines in Design View (not just live sessions)
- Trigger YAML regeneration on param changes with feedback loop guard
- Defer YAML regeneration via queueMicrotask to avoid React setState
  during render warning

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style: format useCompositorLayers.ts

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* feat: add Video Compositor (MoQ Stream) pipeline template

Adds a sample dynamic pipeline that composites two colorbars sources
through the compositor node and streams the result via MoQ (WebTransport).

Pipeline chain: colorbars_bg + colorbars_pip → compositor (2 inputs) →
VP9 encoder → MoQ peer (output broadcast).

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* feat(ui): Complete compositor UX improvements

- Fix YAML pipeline loading: infer compositor output_pixel_format (I420/Rgba8)
- Fix wildcard null matching in canConnectPair for dimension compatibility
- Fix map-style needs parsing in YAML pipeline loader ({pin: node} format)
- Replace Z-index slider with numeric input + bring forward/backward buttons
- Add text overlay management UI (add/remove with default params)
- Add image overlay management UI integrated with asset upload system
- Add collapsible Output Preview panel in Monitor View

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(ui): prevent compositor node overlap in auto-layout

Add estimated height (500px) for video::compositor node kind to prevent
overlapping with downstream nodes during auto-layout positioning.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* feat(ui): compositor UX improvements - layer rendering, floating preview, YAML highlighting

- Render text overlays with actual text content and scaled font in compositor canvas
- Render image overlays as distinct colored rectangles with icon badge
- Apply golden-angle hue spacing for visual layer distinction
- Add layer name overlay and dimension labels on each layer
- Add per-layer controls: opacity slider, rotation slider, z-index with stack buttons
- Replace title tooltips with SKTooltip in overlay remove buttons
- Add useCompositorSelection hook for cross-component layer selection sync
- Highlight selected compositor layer's YAML range in YamlPane
- Redesign output preview from bottom-docked panel to floating draggable window
- Style numeric inputs with design system tokens (borders, focus ring, hidden spinners)
- Fix ESLint import ordering and unused variable warnings

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf(compositor,vp9): eliminate format bounce and add SSE2 SIMD (#62)

* perf(compositor,vp9): eliminate format bounce and add SSE2 SIMD

- Compositor now always outputs RGBA8, removing the per-frame
  rgba8_to_i420_buf call from the compositing thread (~24% CPU).
- VP9 encoder accepts both RGBA8 and I420 inputs; when receiving
  RGBA8 it converts to I420 on its own blocking thread, pipelining
  the conversion with the compositor's next frame.
- Added SSE2 SIMD paths for i420_to_rgba8_buf and rgba8_to_i420_buf
  (Y-plane and chroma subsampling), processing 8 pixels per iteration
  with scalar fallback for tail pixels and non-x86 targets.
- Removed try_i420_passthrough optimisation (no longer needed since
  the compositor always works in RGBA8).
- Simplified CompositeResult to a single rgba_data field.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor): fix i16 overflow in SIMD color conversions, use i32 arithmetic

Both i420_to_rgba8_row_sse2 and rgba8_to_y_row_sse2 now use 32-bit
arithmetic throughout to avoid silent truncation when BT.601
coefficients (298, 409, 516, 129) are multiplied by pixel values
(0-255).  The products can reach ~131,580, well beyond i16::MAX (32,767).

Changes:
- i420_to_rgba8_row_sse2: process 4 pixels/iter in i32 (was 8 in i16)
- rgba8_to_y_row_sse2: process 4 pixels/iter in i32 (was 8 in i16)
- New mul32_sse2 helper: SSE2-compatible i32 multiply via _mm_mul_epu32
  with even/odd lane shuffling
- Add 3 equivalence tests: SIMD-vs-scalar for both directions + roundtrip

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor): fix chroma averaging bug and remove stale output_pixel_format

- rgba8_to_chroma_row_sse2: simplified horizontal pair extraction to
  _mm_packs_epi32(r_sum, zero) instead of complex mask-shift-pack that
  dropped every other 2x2 chroma block (causing visible vertical banding)
- Removed stale output_pixel_format: i420 from video_compositor_demo.yml
  and compositor benchmark (now silently ignored, always outputs RGBA8)
- Removed unused imports (_mm_srli_si128, _mm_set_epi32) from chroma fn

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style: apply cargo fmt to chroma averaging fix

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

---------

Co-authored-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: Claudio Costa <cstcld91@gmail.com>

* feat: NV12 as default video format (#63)

* feat: add NV12 as default video format

- Add PixelFormat::Nv12 variant to core type system with VideoLayout
  plane math for 2-plane NV12 (Y + interleaved UV)
- Update parse_pixel_format to accept 'nv12' format string
- Change default pixel_format across nodes from 'i420' to 'nv12'
- VP9 decoder: output NV12 by interleaving libvpx's I420 U/V planes
- VP9 encoder: accept NV12 via VPX_IMG_FMT_NV12 (zero-conversion path)
- Compositor: add nv12_to_rgba8_buf conversion with SSE2 SIMD reuse
- Colorbars: add NV12 generation and time-stamp support
- Update test utilities for NV12 chroma initialization

NV12's interleaved UV plane is more cache-friendly for RGBA conversion
kernels, and the encoder can consume NV12 directly without format
conversion, making the single-layer passthrough path faster.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix: validate chroma stride before cast, update decoder description to NV12

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf: use thread-local scratch buffers in nv12_to_rgba8_buf SIMD path

Replace per-row Vec allocations with thread_local! RefCell<Vec<u8>>
scratch buffers that are allocated once per thread and reused across
rows. Eliminates ~2×height heap allocations per frame (e.g. 2160
allocs/frame at 1080p) while preserving correctness under both
sequential and rayon parallel execution.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf(nodes): eliminate NV12↔RGBA8 conversion overhead in compositor pipeline (#65)

* perf(nodes): eliminate NV12↔RGBA8 conversion overhead in compositor pipeline

Two targeted fixes for the hot paths identified in CPU profiling:

1. nv12_to_rgba8_buf: Replace thread-local scratch buffer deinterleaving
   with a dedicated nv12_to_rgba8_row_sse2 kernel that reads NV12's
   interleaved UV plane directly.  Eliminates per-row RefCell borrow_mut
   and LocalKey::try_with overhead (~50% of profiled CPU time).

2. VP9 encoder: Convert RGBA8→NV12 instead of RGBA8→I420 so the encoder
   can feed VPX_IMG_FMT_NV12 to libvpx directly, matching the pipeline's
   native NV12 format and avoiding the I420 detour (~28% of profiled CPU).

Adds rgba8_to_nv12_buf() for the new output path.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf(nodes): add SSE4.1 fast-path kernels for color-space conversion

Replace 7-instruction mul32_sse2 emulation with single-instruction
_mm_mullo_epi32 in three hot kernels identified by pprof (mul32_sse2
was 26.49% CPU):

- i420_to_rgba8_row_sse41: 6 native multiplies per pixel
- nv12_to_rgba8_row_sse41: 6 native multiplies per pixel
- rgba8_to_y_row_sse41: 3 native multiplies per pixel

All _buf callers now runtime-detect SSE4.1 and prefer it, falling back
to SSE2 on older hardware. Identical color-space math; no functional
change.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

---------

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: Claudio Costa <cstcld91@gmail.com>

* docs: update VP9 encoder registration to mention NV12 input format

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

---------

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: Claudio Costa <cstcld91@gmail.com>
Co-authored-by: staging-devin-ai-integration[bot] <166158716+staging-devin-ai-integration[bot]@users.noreply.github.com>

* perf: enable thin LTO, codegen-units=1, and target-cpu=native for profiling (#66)

- Add lto = "thin" and codegen-units = 1 to [profile.release] in
  Cargo.toml for cross-crate inlining and maximum LLVM optimisation.
- Add -C target-cpu=native to build-skit-profiling and skit-profiling
  so CPU profiles reflect host-tuned codegen.
- Add new build-skit-native target for max-perf local builds tuned to
  the build host's microarchitecture.
- Docker/CI release builds remain portable (no target-cpu=native in
  Cargo.toml or .cargo/config.toml).

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: Claudio Costa <cstcld91@gmail.com>

* perf(compositor): implement findings 1+4, 2, 5, and 3 for video compositor optimizations (#67)

* perf(compositor): implement findings 1+4, 2, 5, and 3 for video compositor optimizations

- Finding 1+4: Incremental stepper + interior AA skip in scale_blit_rgba_rotated
  Replace per-pixel multiplies with adds by stepping local_x/local_y incrementally.
  When min_dist >= 2.0, batch interior pixels skipping coverage math entirely.

- Finding 2: NV12 interleaved-output SIMD chroma kernel (SSE2)
  New rgba8_to_chroma_row_nv12_sse2 with interleaved U/V store via _mm_unpacklo_epi8.
  Wired into rgba8_to_nv12_buf conversion path.

- Finding 5: Rayon row chunking (8-row blocks)
  Replace per-row rayon tasks with 8-row chunks across all dispatch sites
  (rotated blit, i420/nv12 conversions) to reduce scheduling overhead.

- Finding 3: AVX2 Y-plane kernel (8 pixels/iter)
  New rgba8_to_y_row_avx2 using 256-bit registers, wired with AVX2 > SSE4.1 > SSE2
  priority in both I420 and NV12 Y-plane conversion paths.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor): use copy_nonoverlapping instead of _mm_storeu_si128 in NV12 chroma kernel

_mm_storeu_si128 writes 16 bytes but only 8 are valid (4 UV pairs),
causing out-of-bounds writes on the last chroma row. Use
copy_nonoverlapping with explicit 8-byte length, matching the I420
chroma kernel's store pattern.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor): bound dst_region slice and add rationale comments for cast suppressions

- Bound dst_region to bb_rows * row_stride to avoid dispatching rayon
  tasks beyond the bounding box rows.
- Add explanatory comments for #[allow(clippy::cast_possible_wrap)]
  per AGENTS.md linting discipline requirements.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor): early-out when bounding box is empty (off-screen rect)

When a rotated layer is entirely off-screen, bb_y1 < bb_y0 or
bb_x1 < bb_x0. The subtraction (bb_y1 - bb_y0) as usize would wrap
to a huge value, causing a panic on the bounded dst_region slice.
Add an early return guard before the subtraction.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

---------

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: Claudio Costa <cstcld91@gmail.com>

* style(compositor): fix clippy and rustfmt lint issues in SIMD kernels

- Remove empty line between doc comment blocks for rayon_chunk_rows
- Replace manual div_ceil with .div_ceil() method
- Apply rustfmt formatting to AVX2 import blocks and comments

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf(compositor): cache available_parallelism in LazyLock for rayon_chunk_rows

available_parallelism() issues a sysconf(_SC_NPROCESSORS_ONLN) syscall
on every call (~40µs on Linux). Cache the result in a static LazyLock
so subsequent calls are a simple atomic load (~0.7ns).

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style(compositor): apply rustfmt to LazyLock closure

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor): correct AVX2 lane-crossing in chroma kernels

_mm256_packs_epi32 operates per 128-bit lane, so packing two different
source registers (r_v_a, r_v_b) scrambles the element order — qwords 1
and 2 are swapped.  This caused chroma samples to be spatially
displaced, producing visible horizontal tearing artifacts on composited
overlays.

Fix: apply _mm256_permute4x64_epi64(result, 0xD8) (vpermq) immediately
after each cross-source pack to restore sequential element ordering.
Both rgba8_to_chroma_row_nv12_avx2 and rgba8_to_chroma_row_avx2 are
fixed (3 permutes each — one per R, G, B channel).

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* feat(colorbars): default output pixel format to RGBA8

RGBA8 is more convenient and efficient for compositing workflows since
the compositor operates in RGBA8 internally — no format conversion
needed.

Pipelines that feed colorbars directly into VP9 (without a compositor)
now specify pixel_format: nv12 explicitly to avoid an unnecessary
RGBA8→NV12 conversion inside the encoder.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf(compositor): add AVX2 NV12→RGBA8 kernel and hoist CPU feature detection

- Implement nv12_to_rgba8_row_avx2: processes 8 pixels per iteration
  (double SSE4.1 throughput) using 256-bit i32 arithmetic with drop-to-SSE
  pack/interleave to avoid lane-crossing issues
- Wire AVX2 kernel into nv12_to_rgba8_buf with SSE4.1 tail handling
- Hoist is_x86_feature_detected!() calls outside per-row closures in all
  4 conversion functions (i420_to_rgba8_buf, nv12_to_rgba8_buf,
  rgba8_to_i420_buf, rgba8_to_nv12_buf) to detect once at function start
  and capture in variables

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf(compositor): algorithmic optimizations, SSE2 blend + microbenchmark (#68)

* perf(compositor): add compositor-only microbenchmark

Adds a standalone benchmark that measures composite_frame() in isolation
(no VP9 encode, no mux, no async runtime overhead).

Scenarios:
- 1/2/4 layers RGBA
- Mixed I420+RGBA and NV12+RGBA (measures conversion overhead)
- Rotation (measures rotated blit path)
- Static layers (same Arc each frame, for future cache-hit measurement)

Runs at 640x480, 1280x720, 1920x1080 by default.

Baseline results on this VM (8 logical CPUs):
  1920x1080 1-layer-rgba:     ~728 fps (1.37 ms/frame)
  1920x1080 2-layer-rgba-pip: ~601 fps (1.66 ms/frame)
  1920x1080 2-layer-i420+rgba: ~427 fps (2.34 ms/frame)
  1920x1080 2-layer-nv12+rgba: ~478 fps (2.09 ms/frame)
  1920x1080 2-layer-rgba-rotated: ~470 fps (2.13 ms/frame)

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style: apply rustfmt to compositor_only benchmark

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf(compositor): cache YUV→RGBA conversions + skip canvas clear

Optimization 1: Add ConversionCache that tracks Arc pointer identity
per layer slot. When the source Arc<PooledVideoData> hasn't changed
between frames, the cached RGBA data is reused (zero conversion cost).
Replaces the old i420_scratch buffer approach.

Optimization 2: Skip buf.fill(0) canvas clear when the first visible
layer is opaque, unrotated, and fully covers the canvas dimensions.
Saves one full-canvas memset per frame in the common case.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf(compositor): precompute x-map to eliminate per-pixel division

Optimization 3: Replace per-pixel `(dx + src_col_skip) * sw / rw`
integer division in blit_row_opaque/blit_row_alpha with a single
precomputed lookup table (x_map) built once per scale_blit_rgba call.

Each destination column now does a table lookup instead of a division,
removing O(width * height) divisions per layer per frame.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf(compositor): add identity-scale fast path for 1:1 opaque blits

Optimization 4: When source dimensions match the destination rect,
opacity is 1.0, and there's no clipping offset, bypass the x-map
lookup entirely. For fully-opaque source rows, use bulk memcpy
(copy_from_slice). For rows with semi-transparent pixels, use a
simplified per-pixel blend without the scaling indirection.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf(compositor): pre-scale image overlays at decode time

Optimization 5: When a decoded image overlay's native dimensions differ
from its target rect, pre-scale it once using nearest-neighbor at
config/update time. This ensures the per-frame blit_overlay call hits
the identity-scale fast path (memcpy) instead of re-scaling every frame.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf(compositor): cache layer configs and skip per-frame sort

Optimization 6: Extract per-slot layer config resolution and z-order
sorting into a rebuild_layer_cache() function that runs only when
config or pin set changes (UpdateParams, pin add/remove, channel close).

Per-frame layer building now uses the cached resolved configs and
pre-sorted draw order instead of doing HashMap lookups and sort_by
on every frame.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf(frame_pool): preallocate video pool buckets at startup

Optimization 7: Change video_default() from with_buckets (lazy, no
preallocation) to preallocated_with_max with 2 buffers per bucket.
This avoids cold-start allocation misses for the first few frames,
matching the existing audio_default() pattern.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style(compositor): fix clippy warnings from optimization changes

- Use map_or instead of match/if-let-else in ConversionCache and
  first_layer_covers_canvas
- Allow expect_used with safety comment in get_or_convert
- Allow dead_code on LayerSnapshot::z_index (sorting moved upstream)
- Allow needless_range_loop in blit_row_opaque/blit_row_alpha (dx used
  for both x_map index and dst offset)
- Allow cast_possible_truncation on idx as i32 in rebuild_layer_cache

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor): address correctness + bench issues from review

- Fix #1 (High): skip-clear now validates source pixel alpha (all pixels
  must have alpha==255) before skipping canvas clear. Prevents blending
  against stale pooled buffer data when RGBA source has transparency.

- Fix #2 (Medium): conversion cache slot indices now use position in the
  full layers slice (with None holes) via two-pass resolution, so cache
  keys stay stable when slots gain/lose frames.

- Fix #3 (Medium): benchmark now calls real composite_frame() kernel
  instead of reimplementing compositing inline. Exercises all kernel
  optimizations (cache, clear-skip, identity fast-path, x-map).

- Fix Devin Review: revert video pool preallocation (was allocating
  ~121MB across all bucket sizes at startup). Restored lazy allocation.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style: apply rustfmt to fix formatting

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* perf(compositor): SSE2 blend, alpha-scan cache, bench pool, lazy prealloc

Fix 4 remaining performance findings:

1. High: Add SSE2 SIMD fast path for RGBA blend loops (blit_row_opaque,
   blit_row_alpha). Processes 4 pixels at a time with fast-paths for
   fully-opaque (direct copy) and fully-transparent (skip) source pixels.

2. Medium: Optimize alpha scan in clear-skip check — skip scan entirely
   for I420/NV12 layers (always alpha=255 after conversion), cache scan
   result by Arc pointer identity for RGBA layers.

3. Medium: Pass VideoFramePool to bench_composite instead of None, so
   benchmark exercises pool reuse like production.

4. Low-Medium: Lazy preallocate on first bucket use — when a bucket is
   first hit, allocate one extra buffer so the second get() is a hit.

Also: inline clear-skip logic to fix borrow checker conflict, remove
unused first_layer_covers_canvas function, add clippy suppression
rationale comments for needless_range_loop.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

---------

Co-authored-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: Claudio Costa <cstcld91@gmail.com>

* feat(compositor-ui): UX improvements for video compositor (#69)

* feat(compositor-ui): UX improvements for video compositor

- Fix preview panel drag bug (inverted Y-axis)
- Fix text/image overlay dragging (extend drag to all layer types)
- Add visibility toggle (eye icon) to all layer types
- Unified layer list showing all layers sorted by z-index
- Visibility-aware canvas rendering (hidden layers show faintly)
- Conditional preview panel (only shows when there's something to preview)
- Fullscreen toggle for preview panel
- Preview activation button in Monitor view top bar (watch-only MoQ)

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor-ui): address 5 UX issues from testing feedback

1. Fix rotation stretching: add transform-origin: center center to LayerBox
2. In-place text editing: double-click text overlay to edit inline on canvas
   - Disable resize handles for text layers (size controlled by font-size)
3. Fix overlay removal caching: add timestamp guard to prevent stale params
   from overwriting local overlay changes during sync
4. Consolidate overlays into unified layers: merge overlay add/remove/edit
   controls into UnifiedLayerList, remove separate OverlayList from render
5. Resizable preview panel: add left/top edge drag handles to resize panel

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor-ui): remove text layer padding and use indexed labels

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor-ui): address review bot findings (escape cancel, visibility sync, memo deps)

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor-ui): guard double-commit on Enter and preserve overlay visibility on re-sync

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor-ui): preserve video layer opacity on visibility re-sync

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor-ui): clear selection on overlay removal to prevent stale selectedLayerId

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

---------

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: Claudio Costa <cstcld91@gmail.com>

* fix(compositor-ui): use committedRef to prevent double-fire on Enter+blur in text edit (#71)

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: Claudio Costa <cstcld91@gmail.com>

* fix(video): preserve aspect ratio in compositor rotation and stream rendering (#70)

* feat(nodes): preserve aspect ratio in rotated compositor layers

Replace the stretch-to-fill mapping in scale_blit_rgba_rotated with a
uniform-scale fit (object-fit: contain).  When a rotated layer's source
aspect ratio differs from the destination rect the image is now centred
with transparent padding instead of being distorted.

- Compute fit_scale = min(rw/sw, rh/sh) for uniform scaling
- Use content-local half-widths (half_cw, half_ch) for the bounding box
  and edge anti-aliasing distances
- Map content coords → source pixels via inv_fit_scale instead of
  normalising through the full rect dimensions
- Add test_rotated_blit_preserves_aspect_ratio unit test
- Update sample pipeline comment to document the behaviour

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): account for rotation angle in compositor fit scale

The previous fit scale only considered the source-to-rect aspect ratio
mismatch, which had no effect when both shared the same ratio (e.g. 4:3
source in a 4:3 rect).  The real issue is that a rotated rectangle's
axis-aligned bounding box is larger than the original, so the content
must be scaled down to fit within the rect after rotation.

New formula:
  rotated_bb_w = src_w·|cos θ| + src_h·|sin θ|
  rotated_bb_h = src_w·|sin θ| + src_h·|cos θ|
  fit_scale = min(rect_w / rotated_bb_w, rect_h / rotated_bb_h)

This ensures the rotated content fits entirely within the destination
rect with transparent padding, producing a natural-looking rotation
regardless of aspect ratio match.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(ui): derive canvas aspect ratio from stream dimensions

Replace hardcoded aspectRatio CSS values ('4 / 3' in StreamView,
'16 / 9' in OutputPreviewPanel) with a dynamic value observed from
the canvas element's width/height attributes.

The new useCanvasAspectRatio hook uses a MutationObserver to track
attribute changes made by the Hang video renderer, ensuring the
displayed aspect ratio always matches the actual video stream.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(ui): use auto width on stream canvas to prevent stretching

When the container is wider than what the aspect ratio allows at
maxHeight 480px, width: 100% caused the canvas to stretch horizontally.
Changed to width: auto + max-width: 100% so the browser computes the
width from the aspect ratio and height constraint, then centers the
canvas with margin: 0 auto.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(ui): skip default canvas dimensions in aspect ratio hook

Check canvas.getAttribute('width'/'height') before reading the
.width/.height properties. A newly-created canvas has default
intrinsic dimensions of 300x150 which would be reported as a
valid 2:1 ratio, causing a layout shift before the first video
frame arrives. Now the hook returns undefined until the Hang
renderer explicitly sets the canvas attributes.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): unify 0° fast path to use aspect-ratio-preserving fit

The near-zero rotation fast path now computes a fitted sub-rect
(uniform scale + centering) before delegating to scale_blit_rgba,
matching the rotated path's aspect-ratio-preserving behaviour.

This eliminates the behavioural discontinuity where 0° rotation
would stretch-to-fill while any non-zero rotation would letterbox.
Animating rotation through 0° no longer causes a visual pop.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

---------

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: Claudio Costa <cstcld91@gmail.com>

* fix(compositor-ui): address 7 UX issues in compositor node (#72)

* fix(compositor-ui): address 7 UX issues in compositor node

Issue #1: Click outside text layer commits inline edit
- Add document.activeElement.blur() in handlePaneClick before deselecting
- Add useEffect on TextOverlayLayer watching isSelected to commit on deselect

Issue #2: Preview panel resizable from all four edges
- Add ResizeEdgeRight and ResizeEdgeBottom styled components
- Extend handleResizeStart edge type to support right/bottom
- Update resizeRef type to match

Issue #3: Monitor view preview extracts MoQ peer settings from pipeline
- Find transport::moq::peer node in pipeline and extract gateway_path/output_broadcast
- Set correct serverUrl and outputBroadcast before connecting
- Import updateUrlPath utility

Issue #4: Deep-compare layer state to prevent position jumps on selection change
- Skip setLayers/setTextOverlays/setImageOverlays when merged state is structurally equal
- Prevents stale server-echoed values from causing visual glitches

Issue #5: Rotate mouse delta for rotated layer resize handles
- Transform (dx, dy) by -rotationDegrees in computeUpdatedLayer
- Makes resize handles behave naturally regardless of layer rotation

Issue #6: Visual separator between layer list and per-layer controls
- Add borderTop and paddingTop to LayerInfoRow for both video and text controls

Issue #7: Text layers support opacity and rotation sliders
- Add rotationDegrees field to TextOverlayState, parse/serialize rotation_degrees
- Add rotation transform to TextOverlayLayer canvas rendering
- Replace numeric opacity input with slider matching video layer controls
- Add rotation slider for text layers

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(compositor-ui): fix preview drag, text state flicker, overlay throttling, multiline text

- OutputPreviewPanel: make panel body draggable (not just header) with
  cursor: grab styling so preview behaves like other canvas nodes
- useCompositorLayers: add throttledOverlayCommit for text/image overlay
  updates (sliders, etc.) to prevent flooding the server on every tick;
  increase overlay commit guard from 1.5s to 3s to prevent stale params
  from overwriting local state; arm guard immediately in updateTextOverlay
  and updateImageOverlay
- CompositorCanvas: change InlineTextInput from <input> to <textarea> for
  multiline text editing; Enter inserts newline, Ctrl/Cmd+Enter commits;
  add white-space: pre-wrap and word-break to text content rendering;
  add ResizeHandles to TextOverlayLayer when selected
- CompositorNode: change OverlayTextInput to <textarea> with vertical
  resize support for multiline text in node controls panel

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

---------

Co-authored-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: Claudio Costa <cstcld91@gmail.com>

* feat(compositor): consolidate overlay transforms + unified z-sorted blit loop

Backend consolidation:
- Add OverlayTransform struct with #[serde(flatten)] for wire-compatible
  common spatial/visual properties (rect, opacity, rotation_degrees, z_index)
- Add rotation_degrees and z_index fields to DecodedOverlay
- Replace three separate blit loops (video, image, text) with a single
  z-sorted BlitItem loop, enabling interleaved layer ordering
- Remove dead blit_overlay() function (replaced by unified path)
- Add SSE2 batched blending for rotated blit interi…
staging-devin-ai-integration bot pushed a commit that referenced this pull request Mar 26, 2026
Critical fixes:
- Exclusive routing: dynamic channel OR static output, never both (fix #1)
- RwLock poison logged as error instead of silently swallowed (fix #2)

Improvements:
- Spawned input-forwarding task uses tokio::select! with shutdown_rx (fix #3)
- validate_connection_types logs at warn for dynamic pin skip (fix #4)
- Document poll_fn starvation bias as accepted trade-off (fix #5)
- Remove unused channels parameter from handle_pin_management (fix #6)

Nits:
- Update DynamicOutputs doc comment, remove stale legacy reference (fix #7)
- Use Arc short form (already imported) (fix #8)
- Improve test to exercise MoqPeerNode::new + output_pins + make_dynamic_output_pin (fix #9)

Also refactored handle_pin_management and process_frame_from_group to
reduce cognitive complexity below the 50-point lint threshold by
extracting route_packet, spawn_dynamic_input_forwarder,
insert_dynamic_output, remove_dynamic_output, and make_dynamic_input_pin
helper methods.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>
streamer45 added a commit that referenced this pull request Mar 26, 2026
* feat(transport): add dynamic pin support to moq_peer and moq_push

Generalize MoQ transport nodes to discover and create tracks/pins
dynamically from catalogs instead of hardcoding audio+video pairs.

moq_peer changes:
- Set supports_dynamic_pins() to true
- Thread DynamicOutputs (Arc<RwLock<HashMap>>) through the publisher
  call chain: run -> start_publisher_task_with_permit ->
  publisher_receive_loop -> watch_catalog_and_process ->
  spawn_track_processor -> process_publisher_frames ->
  process_frame_from_group
- In watch_catalog_and_process, build track-named dynamic pin names
  (e.g. audio/data, video/hd) from catalog entries
- In process_frame_from_group, send frames to both the dynamic
  (track-named) output pin and the legacy pin for backward compat
- Handle all PinManagementMessage variants in handle_pin_management
- Accept both EncodedAudio(Opus) and EncodedVideo(VP9) on both
  input pins (in/in_1) for flexible media routing

moq_push changes:
- Set supports_dynamic_pins() to true
- Accept both EncodedAudio(Opus) and EncodedVideo(VP9) on both
  input pins (in/in_1)
- Handle dynamic input pin creation via PinManagementMessage,
  mapping each new pin to a corresponding MoQ track
- Add pin management select branch in the run loop

Engine changes (dynamic_actor.rs):
- In validate_connection_types, skip strict type validation for
  source pins on nodes that support dynamic pins
- In connect_nodes, create output pins on-demand via
  RequestAddOutputPin -> AddedOutputPin flow when the pin
  distributor doesn't exist but the node supports dynamic pins

All existing pipeline YAML files continue to work unchanged.
Legacy out/out_1 and in/in_1 pins remain as stable fallbacks.

Refs: #197

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(transport): address review feedback on dynamic pin support

- Poll dynamic input receivers in moq_push select loop using poll_fn
- Determine is_video from pin name prefix convention instead of accepts_types
- Forward dynamic input pin packets in moq_peer instead of dropping channel
- Use DynamicInputState struct instead of tuple for type clarity

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* test(transport): add regression tests for dynamic pin fixes

- Test make_dynamic_output_pin produces correct types for video/audio/bare names
- Test AddedInputPin channel is not dropped (regression for channel discard bug)
- Test is_video determination uses pin name prefix convention
- Test track name derivation from pin names

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(transport): fix double-prefixed pin names and shutdown cleanup

- Use catalog track names directly (already prefixed) instead of
  re-prefixing with audio/ or video/, which caused double-prefixed
  names like 'audio/audio/data'
- Finish dynamic input track producers on MoqPushNode shutdown
- Add regression test for double-prefix bug

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(transport): finish track producers on remove, add stats to dynamic input forwarding

- RemoveInputPin now calls finish() on track producers before dropping
- Dynamic input forwarding tasks in moq_peer report received/sent stats
  via stats_delta_tx, matching the static pin handler pattern

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* refactor(transport): remove legacy pin names, use track-named pins exclusively

BREAKING CHANGE: moq_peer output pins renamed from out/out_1 to
audio/data and video/data to match catalog track names. Removes
audio_output_pin/video_output_pin parameters from the entire publisher
call chain (start_publisher_task_with_permit, publisher_receive_loop,
watch_catalog_and_process, spawn_track_processor, process_publisher_frames,
process_frame_from_group). Unifies output_pin and dynamic_pin_name into
a single track-name-based output pin. Updates all sample pipeline YAML
files to reference the new pin names.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix: update remaining out_1 references in samples, e2e fixtures, tests, and docs

Updates missed references to the old moq_peer out/out_1 pin names:
- samples/pipelines/dynamic/video_moq_webcam_pip.yml
- samples/pipelines/dynamic/video_moq_screen_share.yml
- e2e/fixtures/webcam-pip.yaml
- e2e/fixtures/webcam-pip-cropped.yaml
- e2e/fixtures/webcam-pip-circle.yaml
- crates/api/src/yaml.rs (parser tests)
- docs/src/content/docs/guides/creating-pipelines.md

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* refactor(transport): address all 9 review items on dynamic pin support

Critical fixes:
- Exclusive routing: dynamic channel OR static output, never both (fix #1)
- RwLock poison logged as error instead of silently swallowed (fix #2)

Improvements:
- Spawned input-forwarding task uses tokio::select! with shutdown_rx (fix #3)
- validate_connection_types logs at warn for dynamic pin skip (fix #4)
- Document poll_fn starvation bias as accepted trade-off (fix #5)
- Remove unused channels parameter from handle_pin_management (fix #6)

Nits:
- Update DynamicOutputs doc comment, remove stale legacy reference (fix #7)
- Use Arc short form (already imported) (fix #8)
- Improve test to exercise MoqPeerNode::new + output_pins + make_dynamic_output_pin (fix #9)

Also refactored handle_pin_management and process_frame_from_group to
reduce cognitive complexity below the 50-point lint threshold by
extracting route_packet, spawn_dynamic_input_forwarder,
insert_dynamic_output, remove_dynamic_output, and make_dynamic_input_pin
helper methods.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(transport): eliminate TOCTOU race in route_packet

Hold a single read lock for both the existence check and the send in
route_packet, preventing a concurrent RemoveOutputPin from removing the
entry between two separate lock acquisitions which would silently drop
the packet.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(transport): handle closed dynamic output channels in route_packet

Distinguish try_send results: Ok and Full return true (packet sent or
acceptable frame drop for real-time media), Closed returns false to
trigger shutdown — matching the static output path behaviour.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(transport): keep track processor alive on closed dynamic channel

A closed dynamic output channel (downstream consumer disconnected)
now removes the stale entry and continues instead of triggering
FrameResult::Shutdown. This prevents a single consumer disconnect
from killing the entire track processor.

Also extract track_name_from_pin() and is_video_pin() into named
functions in push.rs so tests exercise the real production code
instead of duplicating the logic inline.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(transport): keep dynamic input forwarder alive when no subscribers

Match the static input path behaviour: discard frames with
`let _ = tx.send(frame)` instead of breaking out of the loop when
there are no active broadcast receivers. This prevents the dynamic
input forwarder from permanently shutting down between subscriber
connections.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(transport): address review round 3 — catalog republish, single-lock route_packet, cleanup

- Re-publish MoQ catalog when dynamic tracks are added/removed (push.rs)
- Merge route_packet double RwLock acquisition into single lock with RouteOutcome enum
- Add design rationale comment on std::sync::RwLock choice for DynamicOutputs
- Extract moq_accepted_media_types() helper, deduplicate across peer/mod.rs and push.rs
- Change dynamic pin validation log from warn to debug (dynamic_actor.rs)
- Use Arc::default() consistently for DynamicOutputs construction
- Update moq_peer.yml comment to mention video/data output pin
- Remove unused type imports from push.rs

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(transport): downgrade catalog republish log to debug

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix: address round 4 review — packet drops, forwarder lifecycle, type validation

- route_packet: match on TrySendError::Full/Closed via RouteOutcome enum,
  log dropped packets at debug level instead of silently discarding
- Store JoinHandle for each dynamic input forwarder in a HashMap;
  abort on RemoveInputPin to prevent task leaks
- After dynamic output pin creation in connect_nodes, validate type
  compatibility using can_connect_any before wiring
- republish_catalog returns bool; on failure roll back catalog entry
  and skip adding DynamicInputState
- Use swap_remove instead of remove for O(1) dynamic_inputs removal
- Consistent lock-poisoning recovery via unwrap_or_else(PoisonError::into_inner)
- Align default dynamic pin names (in_dyn → dynamic_in)
- Extract activate_dynamic_input, insert/remove_catalog_rendition helpers
  to stay within cognitive_complexity limit

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix: clean up stale resources on dynamic pin creation failures

- Type-mismatch early return in connect_nodes now removes the orphaned
  PinDistributor entry and stale pin metadata before returning
- AddedOutputPin send failure path gets the same cleanup
- Document that validate_connection_types skips dest-pin validation too
  when source node supports dynamic pins (known limitation)
- RemoveInputPin in push.rs uses swap_remove instead of drain+collect
- Prune finished forwarder JoinHandles on AddedInputPin to prevent
  unbounded growth from naturally-closed channels
- Add safety comment about poll_fn/select! mutable borrow interaction
- Deduplicate output_pins() by reusing make_dynamic_output_pin

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix: finish leaked producers, shut down orphaned distributors, cleanup nits

- activate_dynamic_input: finish track producer before returning on
  catalog republish failure to avoid dangling broadcast track
- connect_nodes: send PinConfigMsg::Shutdown to the spawned
  PinDistributor on both type-mismatch and AddedOutputPin send failure
  error paths, preventing orphaned actor tasks
- Abort all forwarder JoinHandles on node shutdown for deterministic
  cleanup instead of relying on channel close propagation
- Remove redundant 'let mut catalog_producer = catalog_producer' rebinding
- Downgrade subscriber_count atomics from SeqCst to Relaxed (only used
  for logging, no cross-variable synchronization needed)

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix: rollback leaked input pins, guard duplicates, timeout pin creation

- Add rollback_dynamic_input helper to clean up destination input pins
  when step-2 (output pin creation) fails in connect_nodes
- Track created_dynamic_input to conditionally rollback on all 6 step-2
  failure paths (type mismatch, send failures, timeouts)
- Wrap RequestAddInputPin and RequestAddOutputPin responses with
  tokio::time::timeout(5s) to prevent engine deadlock
- Guard duplicate dynamic input pin names in push.rs with
  check-and-replace via swap_remove
- Abort old forwarder handle on re-add collision in peer/mod.rs
- Extract activate_dynamic_input_forwarder to reduce cognitive complexity
- Bump stale dynamic output entry log from debug to info
- Make original catalog binding mut, remove redundant rebind
- Align moq_accepted_media_types() import qualification
- Shut down orphaned PinDistributor actors on type-mismatch and
  AddedOutputPin send failure paths

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix: assert pin.name == from_pin invariant on dynamic output creation

Add debug_assert_eq! after receiving the pin definition from
RequestAddOutputPin to make the implicit contract explicit: the
node must return the suggested name unchanged.

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

---------

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: Claudio Costa <cstcld91@gmail.com>
staging-devin-ai-integration bot pushed a commit that referenced this pull request Apr 14, 2026
…r nits

Finding #5: Extract generic vaapi_decode_loop_body<D>() and
vaapi_drain_decoder_events<D>() in vaapi_av1.rs, parameterised on
StatelessVideoDecoder codec type.  Both vaapi_h264_decode_loop and
vaapi_av1_decode_loop now delegate to these shared helpers, removing
~130 lines of near-identical code.  The AV1 decode loop init is
simplified to use the existing open_va_and_gbm() helper.

Finding #8: Add comment block explaining why the Vulkan Video H.264
encoder does not use StandardVideoEncoder / spawn_standard_encode_task
(no flush(), eager device pre-init, different dimension-change model).

Finding #9: Remove redundant init_vulkan_encode_device() call inside
the dimension-change block — the Vulkan device is pre-initialised and
never cleared, so we use it directly instead of cloning through the
init helper.  Also removes the now-unnecessary device re-assignment.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>
streamer45 added a commit that referenced this pull request Apr 14, 2026
…AV1, NVENC/NVDEC AV1) (#279)

* feat(nodes): add HW video codec backends (Vulkan Video H.264, VA-API AV1, NVENC/NVDEC AV1)

Implement hardware-accelerated video encoding and decoding for StreamKit,
targeting Linux with Intel and NVIDIA GPUs (issue #217).

Three backends behind optional feature flags:

  vulkan_video — H.264 encode/decode via Vulkan Video (vk-video v0.3).
    Cross-vendor (Intel ANV, NVIDIA, AMD RADV). Includes lazy encoder
    creation on first frame for resolution detection, NV12/I420 input
    support, and configurable bitrate/framerate/keyframe interval.

  vaapi — AV1 encode/decode via VA-API (cros-codecs v0.0.6).
    Primarily Intel (intel-media-driver), also AMD. Uses GBM surfaces
    for zero-copy VA-API buffer management. Includes stride-aware
    NV12 plane read/write helpers with odd-width correctness.

  nvcodec — AV1 encode/decode via NVENC/NVDEC (shiguredo_nvcodec v2025.2).
    NVIDIA only (RTX 30xx+ decode, RTX 40xx+ AV1 encode). Dynamic CUDA
    loading — no build-time CUDA Toolkit required for the host binary.

All backends share:
- HwAccelMode enum (auto/force_hw/force_cpu) for graceful fallback
- ProcessorNode trait integration with health reporting
- Consistent config structs with serde deny_unknown_fields validation
- Comprehensive unit tests (mock-based, no GPU required)

Closes #217

Signed-off-by: Devin AI <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* ci: run nvcodec tests on GPU runner

The self-hosted GPU runner (skit-demo-eu-gpu) has an NVIDIA GPU but the
CI workflow wasn't exercising the nvcodec feature tests. Add the missing
cargo test invocation so NVENC/NVDEC AV1 tests run alongside the
existing GPU compositor tests.

Signed-off-by: Devin AI <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* ci: install CUDA headers on GPU runner for nvcodec tests

The shiguredo_nvcodec build script requires cuda.h at compile time.
Install nvidia-cuda-toolkit on the self-hosted GPU runner if CUDA
headers aren't already present.

Signed-off-by: Devin AI <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* ci: set CUDA_INCLUDE_PATH for nvcodec build on GPU runner

Ubuntu's nvidia-cuda-toolkit installs cuda.h to /usr/include, but
shiguredo_nvcodec's build script defaults to /usr/local/cuda/include.
Set CUDA_INCLUDE_PATH=/usr/include so the build finds the headers.

Signed-off-by: Devin AI <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* ci: fix nvcodec build on GPU runner (BINDGEN_EXTRA_CLANG_ARGS)

Remove conditional nvidia-cuda-toolkit install (already pre-installed
on the self-hosted runner) and add BINDGEN_EXTRA_CLANG_ARGS to point
bindgen at the LLVM 18 clang builtin includes so stddef.h is found.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* ci: reorder GPU tests so nvcodec runs before engine

The streamkit-engine GPU test binary segfaults (SIGSEGV) during
cleanup after all 25 tests pass — this is a pre-existing issue
likely related to wgpu/Vulkan teardown.  Move the nvcodec node
tests before the engine GPU tests so they are not blocked by
the crash.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): add missing framerate field in nvcodec test

The force_cpu_encoder_rejected test was constructing
NvAv1EncoderConfig with all fields explicitly but missed the
new framerate field added in the review-fix round.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): register HW codec nodes, fix i420_to_nv12 truncation, remove dead code

- Add cfg-gated registration calls for vulkan_video, vaapi, and nvcodec
  nodes in register_video_nodes() — without these, users enabling the
  features would get 'node not found' errors at runtime.
- Fix i420_to_nv12 in vulkan_video.rs to use div_ceil(2) for chroma
  dimensions instead of truncating integer division (h/2, w/2), matching
  the correct implementation in nv_av1.rs.
- Update HwAccelMode::Auto doc comment to accurately reflect that
  HW-only nodes do not implement CPU fallback — Auto and ForceHw
  behave identically; CPU fallback is achieved by selecting a different
  (software) node at the pipeline level.
- Remove dead default_quality() and default_framerate() functions in
  vaapi_av1.rs (unused — the struct uses a manual Default impl).
- Add registration regression tests to nv_av1 and vaapi_av1 modules.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): add encoder flush comment, validate cuda_device, use GBM plane offsets

- vulkan_video.rs: document that vk-video 0.3.0 BytesEncoder has no
  flush() method (unlike BytesDecoder); frame-at-a-time, no B-frames
- nv_av1.rs: reject cuda_device > i32::MAX at construction time
  instead of silently wrapping via 'as i32' cast
- vaapi_av1.rs: use gbm_frame.get_plane_offset() for FrameLayout
  instead of manually computing y_stride * coded_height; also fix
  stride fallback to use coded_width instead of display width

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(skit): forward HW codec feature flags from streamkit-server to streamkit-nodes

Without these forwarding features, `just extra_features="--features vulkan_video" skit`
would silently ignore the feature since streamkit-server didn't know about it.

Adds vulkan_video, vaapi, and nvcodec feature forwarding, matching the
existing pattern for svt_av1 and dav1d.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* docs(samples): add HW video codec sample pipelines

Add oneshot and dynamic (MoQ) sample pipelines for each HW video codec
backend:

- Vulkan Video H.264: video_vulkan_video_h264_colorbars (oneshot + MoQ)
- VA-API AV1: video_vaapi_av1_colorbars (oneshot + MoQ)
- NVENC AV1: video_nv_av1_colorbars (oneshot + MoQ)

Each oneshot pipeline generates SMPTE color bars, HW-encodes, muxes into
a container (MP4 for H.264, WebM for AV1), and outputs via HTTP.

Each dynamic pipeline generates color bars, HW-encodes, and streams via
MoQ for live playback in the browser.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): revert get_plane_offset to computed fallback

get_plane_offset() is private in cros-codecs 0.0.6. Fall back to
computing the UV plane offset from pitch × coded_height, which is
correct for linear NV12 allocations used by VA-API encode surfaces.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style: format vaapi_av1.rs

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* feat(nodes): add VA-API H.264 encoder and decoder nodes

Add vaapi_h264 module with VaapiH264EncoderNode and VaapiH264DecoderNode
using cros-codecs StatelessEncoder/StatelessDecoder for H.264 via VA-API.

- Encoder: CQP rate control, Main profile, macroblock-aligned coding
- Decoder: stateless H.264 decode with format-change handling
- Reuses shared helpers from vaapi_av1 (GBM/NV12 I/O, device detection)
- Registration: video::vaapi::h264_encoder, video::vaapi::h264_decoder
- Sample pipelines: oneshot MP4 + dynamic MoQ for VA-API H.264

Supported on Intel (Sandy Bridge+), AMD, and NVIDIA (decode only).

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* feat(nodes): add VA-API H.264 encoder and decoder nodes

Add vaapi_h264 module with VaapiH264EncoderNode and VaapiH264DecoderNode
using cros-codecs StatelessEncoder/StatelessDecoder for H.264 via VA-API.

- Encoder: CQP rate control, Main profile, macroblock-aligned coding
- Decoder: stateless H.264 decode with format-change handling
- Reuses shared helpers from vaapi_av1 (GBM/NV12 I/O, device detection)
- Registration: video::vaapi::h264_encoder, video::vaapi::h264_decoder
- Sample pipelines: oneshot MP4 + dynamic MoQ for VA-API H.264

Supported on Intel (Sandy Bridge+), AMD, and NVIDIA (decode only).

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): auto-detect VA-API H.264 encoder entrypoint

Modern Intel GPUs (Gen 9+ / Skylake onwards) only expose the low-power
fixed-function encoder (VAEntrypointEncSliceLP), not the full encoder
(VAEntrypointEncSlice).  Query the driver for supported entrypoints and
auto-select the correct one instead of hardcoding low_power=false.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): bypass GBM for VA-API encoders, use direct VA surfaces

Replace GBM-backed frame allocation with direct VA surface creation
and Image API uploads for both H.264 and AV1 VA-API encoders.

The cros-codecs GBM allocator uses GBM_BO_USE_HW_VIDEO_ENCODER, a flag
that Mesa's iris driver does not support for NV12 on some hardware
(e.g. Intel Tiger Lake with Mesa 23.x), causing 'Error allocating
contiguous buffer' failures.

By using libva Surface<()> handles instead:
- Surfaces are created via vaCreateSurfaces (no GBM needed)
- NV12 data is uploaded via the VA Image API (vaCreateImage + vaPutImage)
- The encoder's import_picture passthrough accepts Surface<()> directly
- Pitches/offsets come from the VA driver's VAImage, not GBM

This also adds two new shared helpers in vaapi_av1.rs:
- open_va_display(): opens VA display without GBM device
- write_nv12_to_va_surface(): uploads NV12/I420 frame data to a VA
  surface using the Image API, returning driver pitches/offsets

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): use ceiling division for chroma dimensions in VA surface upload

write_nv12_to_va_surface used truncating integer division (w / 2, h / 2)
for chroma plane dimensions, which would corrupt chroma data for frames
with odd width or height.  VideoLayout::packed uses (width + 1) / 2 for
chroma dimensions, so the upload function must match.

Changes:
- NV12 path: use (h+1)/2 for uv_h, ((w+1)/2)*2 for chroma row bytes
- I420 path: use (w+1)/2 for uv_w, (h+1)/2 for uv_h

This matches the existing write_nv12_to_mapping (which uses div_ceil)
and i420_to_nv12_buffer in nv_av1.rs.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): remove incorrect .min(w) clamp on NV12 UV row copy

For odd-width frames, chroma_row_bytes (e.g. 642 for w=641) is the
correct number of bytes per UV row in VideoLayout::packed format.
Clamping to .min(w) would drop the last V sample on every UV row.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style(nodes): fix rustfmt for VA surface UV copy

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* feat(e2e): add headless pipeline validation tests (#285)

* chore(registry): publish marketplace registry update (#283)

* chore(registry): publish marketplace registry update

* fix(marketplace): prevent plugin releases from becoming latest on GitHub

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

---------

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: Claudio Costa <cstcld91@gmail.com>

* feat(e2e): add headless pipeline validation tests

Add a Rust-based test framework for validating oneshot pipelines against
a live skit server using ffprobe for output verification. No browser
required.

Architecture:
- datatest-stable discovers .yml files in samples/pipelines/test/
- Each .yml has a companion .toml sidecar with expected output metadata
- Tests POST the pipeline YAML to /api/v1/process, save the response,
  and validate codec, resolution, container format via ffprobe
- HW codec tests (NVENC AV1, Vulkan Video H.264) are skipped gracefully
  when the required node kind is not registered on the server

New files:
- tests/pipeline-validation/          Standalone Rust test crate
- samples/pipelines/test/*.yml        4 short test pipelines (30 frames)
- samples/pipelines/test/*.toml       Expected output metadata sidecars
- justfile: test-pipelines recipe

Usage: just test-pipelines http://localhost:4545
  just test-pipelines http://localhost:4545 vp9   # filter by name
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* refactor(e2e): restructure test pipelines to one-dir-per-test layout

Move from flat files to directory-based test layout:

  samples/pipelines/test/<name>/pipeline.yml
  samples/pipelines/test/<name>/expected.toml

Each test is self-contained in its own directory, making it easier to
add test-specific input media or extra config in the future. The
datatest-stable harness now matches on 'pipeline.yml' recursively.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* feat(e2e): add SVT-AV1 test pipeline and CI integration

- Add svt_av1_colorbars test pipeline (SW codec, requires svt_av1 feature)
- Add pacer node to VP9 pipeline for consistency with other WebM pipelines
- Add pipeline-validation job to e2e.yml CI workflow — runs SW codec tests
  (VP9, OpenH264, SVT-AV1) against a live skit server with ffprobe validation
- GPU-specific tests (NVENC AV1, Vulkan Video H.264) are skipped in CI
  via the requires_node mechanism

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(ci): fail explicitly when skit server doesn't start

Add HEALTHY flag to health check loop so the pipeline-validation CI job
fails with a clear error instead of proceeding to run tests against a
server that never became healthy.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* feat(e2e): add complete format coverage for pipeline validation tests

Extend the pipeline validation framework to support audio-only tests and
file upload pipelines, then add 8 new test cases covering all core
codecs, muxers, and demuxers:

Audio codec tests:
- opus_roundtrip: Opus encode/decode via Ogg container
- opus_mp4: Opus encode via MP4 container (file mode)
- flac_decode: FLAC decoder (symphonia) → Opus/Ogg
- mp3_decode: MP3 decoder (symphonia) → Opus/Ogg
- wav_decode: WAV demuxer (symphonia) → Opus/Ogg

Video codec/decoder tests:
- rav1e_colorbars: rav1e AV1 encoder → WebM
- vp9_roundtrip: VP9 encode → decode → re-encode roundtrip
- dav1d_roundtrip: SVT-AV1 → dav1d decode → SVT-AV1 re-encode

Framework changes:
- Expected struct now supports audio-only tests (audio_codec,
  sample_rate, channels) and file uploads (input_file)
- run_pipeline() accepts optional input file for multipart upload
- validate_output() validates audio and/or video stream properties
- Test audio fixtures (Ogg/Opus, FLAC, MP3, WAV) in fixtures/

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* chore: add REUSE/SPDX license files for test audio fixtures

Adds CC0-1.0 license companion files for the generated test tone
audio fixtures (ogg, flac, mp3, wav) to satisfy the reuse-compliance
check.

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* feat(ci): add GPU pipeline validation job on self-hosted runner

Adds a 'Pipeline Validation (GPU)' job to the E2E workflow that runs
on the self-hosted GPU runner. This builds skit with gpu, svt_av1, and
dav1d_static features, starts the server, and runs all pipeline
validation tests.

Currently the NVENC AV1 and Vulkan Video H.264 tests will skip
gracefully since those features (nvcodec, vulkan_video) aren't on main
yet. Once PR #279 merges, adding those features to the build command
will enable full HW codec pipeline validation in CI.

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(ci): use alternate port for GPU pipeline validation server

The self-hosted GPU runner has a persistent skit instance on port
4545. Use port 4546 for the pipeline validation server to avoid
'Address already in use' errors.

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(ci): add libssl-dev for GPU pipeline validation runner

The pipeline-validation test crate depends on reqwest which pulls in
openssl-sys. The self-hosted GPU runner needs libssl-dev installed.

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(test): correct multipart field name and audio channel config

- Use 'media' instead of 'file' for the multipart field name to match
  the server's http_input binding convention.
- Set channels: 2 on all opus_encoder nodes since test fixtures are
  stereo, fixing 'Incompatible connection' errors on the GPU runner.

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(test): use mono audio fixtures to match Opus encoder pin type

The Opus encoder node's input pin is hardcoded to accept mono audio
(channels: 1). Regenerate all test fixtures as mono sine waves and
update pipeline configs and expected.toml files accordingly.

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(ci): add cleanup step to kill skit on self-hosted runner

Self-hosted runners persist between runs, so background processes can
accumulate. Add an always-run cleanup step to kill the skit process
after tests complete.

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(test): remove incompatible audio decode pipelines and fix GPU cleanup

Remove flac_decode, mp3_decode, and wav_decode test pipelines: the FLAC
decoder, MP3 decoder, and WAV demuxer all declare channels: 2 in their
static output pins, but the Opus encoder (the only audio encoder) only
accepts channels: 1. This static type mismatch causes pipeline validation
to reject the connection before any audio data flows.

Audio codec coverage is retained via opus_roundtrip (Ogg) and opus_mp4
(MP4) tests which exercise the full Opus encode/decode path.

Also fix GPU CI cleanup: use PID-based kill instead of pkill pattern
matching (the port number is in an env var, not the command line).

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* feat(ci): enable nvcodec + vulkan_video in GPU pipeline validation

Now that the branch is rebased on top of PR #279 (HW video codecs),
enable the nvcodec and vulkan_video features in the GPU CI build so
the nv_av1_colorbars and vulkan_video_h264_colorbars tests actually
run on the self-hosted GPU runner.

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(test): correct doc comment for multipart field name

The doc comment said 'file' but the code uses 'media'.

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

---------

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Signed-off-by: Devin AI <devin@streamkit.dev>
Co-authored-by: streamkit-bot <registry-bot@streamkit.dev>
Co-authored-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: Claudio Costa <cstcld91@gmail.com>

* fix(e2e): improve pipeline test diagnostics and GPU CI reliability

- Add file size to ffprobe error messages for easier debugging
- Detect empty response bodies (encoder failed to produce output)
- Capture skit server logs in GPU CI job for post-mortem analysis
- Use --test-threads=1 for GPU tests to avoid NVENC session contention

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): eagerly init Vulkan device to prevent empty output on fast pipelines

The Vulkan Video H.264 encoder lazily initialised the VulkanDevice
inside the blocking encode task on the first frame.  On GPUs where
device creation takes ~500 ms (common on CI runners), short pipelines
such as colorbars (30 frames in ~12 ms) would close the input stream
before the encoder was ready, resulting in zero encoded packets and an
empty HTTP response.

Move device initialisation to a dedicated spawn_blocking call that
completes before the encode loop starts.  The BytesEncoder is still
created lazily on the first frame (to know the resolution), but the
expensive Vulkan instance/adapter/device setup is already done.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): eagerly init Vulkan device to prevent empty output on fast pipelines

The Vulkan Video H.264 encoder lazily initialised the VulkanDevice
inside the blocking encode task on the first frame.  On GPUs where
device creation takes ~500 ms (common on CI runners), short pipelines
such as colorbars (30 frames in ~12 ms) would close the input stream
before the encoder was ready, resulting in zero encoded packets and an
empty HTTP response.

Move device initialisation to a dedicated spawn_blocking call that
completes before the encode loop starts.  The BytesEncoder is still
created lazily on the first frame (to know the resolution), but the
expensive Vulkan instance/adapter/device setup is already done.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): add panic detection and lifecycle tracing to codec forward loop

- Log which select branch fires in codec_forward_loop (drain path)
- Detect and log panics from codec tasks instead of silently swallowing
- Track frames_encoded count in Vulkan encoder task
- Increase GPU CI server log capture from 100 to 500 lines
- Enable debug logging for codec/vulkan modules in GPU CI

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): add panic detection and lifecycle tracing to codec forward loop

- Log which select branch fires in codec_forward_loop (drain path)
- Detect and log panics from codec tasks instead of silently swallowing
- Track frames_encoded count in Vulkan encoder task
- Increase GPU CI server log capture from 100 to 500 lines
- Enable debug logging for codec/vulkan modules in GPU CI

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): force first Vulkan Video H.264 frame as IDR keyframe

The MP4 muxer gates all video packets until it sees the first keyframe.
The colorbars source does not set metadata.keyframe, so force_keyframe
defaulted to false for every frame.  Without an explicit IDR request
the Vulkan Video encoder may not mark the first frame as a keyframe,
causing the muxer to skip all 30 packets and produce an empty output.

Also fix clippy lint: collapse identical Ok(())/Err(_) match arms in
codec_forward_loop's codec-task await.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): add diagnostic tracing for pipeline shutdown race condition

Add targeted tracing to identify why BytesOutputNode exits before
receiving data from the MP4 muxer:

- recv_with_cancellation: distinguish cancellation-token vs channel-close
- graph_builder: log when each node task completes (success or error)
- mp4 muxer: log keyframe gate decisions (first keyframe seen vs skip)

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): await codec task before draining to prevent shutdown race

The codec_forward_loop drain phase previously interleaved with the
(potentially slow) blocking encode task.  On fast pipelines the drain
could take 100+ ms while downstream nodes (MP4 muxer, BytesOutputNode)
processed and closed their channels, resulting in zero-byte output.

Restructure the drain so that we:
1. Break out of the select loop when the input task completes.
2. Await the codec (blocking) task to completion — all results are now
   buffered in result_rx.
3. Drain the fully-buffered results in a tight loop, forwarding them
   downstream before any channel can close.

This eliminates the race window between result forwarding and downstream
shutdown.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style: format codec_utils.rs

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): resolve clippy cognitive_complexity in codec_utils and mp4

- codec_utils: extract finish_codec_task helper to reduce nesting
- mp4: flatten keyframe gate logic to remove nested if/else

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* refactor(nodes): extract accumulate_video_sample to fix mp4 cognitive_complexity

Extract video frame accumulation logic (Annex B → AVCC conversion,
sample entry tracking, duration calculation) into a standalone helper
to bring run_stream_mode under the cognitive_complexity threshold.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* refactor(nodes): extract accumulate_audio_sample to further reduce mp4 complexity

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): abort codec task before awaiting in non-drain path

Fixes a potential deadlock: when the output channel closes (e.g. client
disconnect), the select loop breaks with drain_pending=false.  Without
aborting the codec task first, it may be blocked on blocking_send() with
a full result channel that nobody is draining, causing finish_codec_task
to wait forever.

Identified by Devin Review.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* refactor(nodes): extract check_video_keyframe_gate to further reduce mp4 complexity

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): populate avcC chroma fields for High profile H.264

The shiguredo_mp4 library requires chroma_format, bit_depth_luma_minus8,
and bit_depth_chroma_minus8 fields in the AvccBox for profiles other
than Baseline (66), Main (77), and Extended (88).  HW encoders like
Vulkan Video typically produce High profile (100) H.264, causing
'Missing chroma_format field in avcC box' when the MP4 muxer tries to
create the init segment.

Set 4:2:0 chroma (1) and 8-bit depth (0) for non-Baseline/Main/Extended
profiles, matching the NV12 format used by all HW encoder backends.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): drain results concurrently with codec task to prevent deadlock

The previous fix (awaiting the codec task before draining) introduced a
deadlock when the codec produces more results than the bounded channel
capacity (32).  The codec task blocks on blocking_send() waiting for
space, but nobody is draining result_rx because we're waiting for the
codec task to finish first.

Fix by using tokio::select! with biased polling: drain results from
result_rx (keeping the channel flowing) while simultaneously awaiting
the codec task.  Once the codec task finishes, result_tx is dropped
and result_rx.recv() returns None, ending the drain loop naturally
with all results forwarded.

This fixes opus_mp4 and opus_roundtrip pipeline validation tests that
were hanging because the OpusEncoder produces ~51 frames (exceeding
the 32-capacity channel).

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* refactor(nodes): extract drain_codec_results to reduce cognitive complexity

Extract the concurrent drain loop into a separate
drain_codec_results() function to bring codec_forward_loop back under
the clippy cognitive_complexity limit (50).

Also adds Send + Sync bounds to the to_packet closure parameter to
support the extracted async function.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): detect keyframe from VA-API encoder bitstream output

The VA-API AV1 and H.264 encoders were cloning input metadata without
updating the keyframe flag based on actual encoder output.  This caused
downstream consumers (MP4 muxer, RTMP/MoQ transport) to miss keyframes,
particularly encoder-initiated periodic keyframes from the LowDelay
prediction structure.

Add bitstream-level keyframe detection:
- AV1: parse OBU headers to find Frame OBU with frame_type == KEY_FRAME
- H.264: scan Annex B start codes for IDR NAL unit type (5)

Both encode() and flush_encoder() paths now set the keyframe flag from
the actual encoded bitstream rather than blindly cloning input metadata.

Also fix HwAccelMode serde rename_all from "lowercase" to "snake_case"
so ForceHw serializes as "force_hw" (not "forcehw").

Include unit tests for all keyframe detection functions.

Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): construct VA-API encoder backend directly to satisfy trait bounds

The `CrosVaapiAv1Encoder` and `CrosVaapiH264Encoder` type aliases use
`Surface<()>` to bypass GBM buffer allocation, but `Surface<()>` does
not implement the `VideoFrame` trait required by `new_vaapi()`.

Replace `new_vaapi()` calls with direct `VaapiBackend::new()` +
`new_av1()`/`new_h264()` construction — the same pattern used by
cros-codecs' own tests — which avoids the `V: VideoFrame` constraint
while preserving the GBM-free surface path.

Also removes unused imports (GbmDevice, GbmExternalBufferDescriptor,
ReadMapping, WriteMapping, CrosFourcc, write_nv12_to_mapping) that were
flagged as warnings.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): restore required imports removed in previous commit

Restore GbmDevice, ReadMapping, WriteMapping, and CrosFourcc imports
that are used by decoder and NV12 helper functions in vaapi_av1.rs.
Only GbmExternalBufferDescriptor was truly unused.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): use GbmVideoFrame for VA-API encoders with runtime GBM fallback

Replace Surface<()> type alias with GbmVideoFrame in both VA-API AV1
and H.264 encoders.  This satisfies the VideoFrame trait bound required
by StatelessEncoder::new_vaapi(), fixing the build with --features vaapi.

At construction time, the encoder probes GBM buffer allocation with
GBM_BO_USE_HW_VIDEO_ENCODER.  If the driver does not support that flag
(e.g. Mesa iris on Intel Tiger Lake with Mesa 23.x), it falls back to
GBM_BO_USE_HW_VIDEO_DECODER which is universally supported and still
produces a valid NV12 buffer the encoder can read.

Also removes the now-unused open_va_display() and
write_nv12_to_va_surface() helper functions, and the direct
VaapiEncBackend import that was only needed for the old manual backend
construction path.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): return error on NV12 bounds-check failure

In write_nv12_to_mapping, the row-copy and I420 UV interleave paths
silently skipped rows when bounds checks failed instead of surfacing
the error. This made it impossible to diagnose corrupted frames from
mismatched buffer sizes.

Change all silent skip patterns to return descriptive error messages
with the exact indices and buffer lengths involved.

Closes #291

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* test(ci): add VA-API test coverage to GPU runner

- Add libva-dev to the test-gpu system dependencies install step.
- Add cargo test with --features vaapi to the GPU test matrix,
  running VA-API AV1 encode/decode tests on the self-hosted runner.
- Add resolution-padding verification test (issue #292) that encodes
  at 1280x720 (coded 1280x768) and asserts decoded frames match the
  original display resolution, not the coded resolution.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(ci): add libgbm-dev to GPU runner dependencies

The cros-codecs VA-API backend links against libgbm for GBM buffer
management. Without libgbm-dev the vaapi feature tests fail to link.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): skip VA-API encode tests on decode-only drivers

NVIDIA's community nvidia-vaapi-driver only supports VA-API decode,
not encode. The existing vaapi_available() check only verifies that a
VA-API display can be opened, which succeeds on NVIDIA — but the
encoder tests then fail because no encode entrypoints exist.

Add vaapi_av1_encode_available() and vaapi_h264_encode_available()
helpers that probe whether the driver actually supports encoding by
attempting to construct the encoder. Encode tests now skip gracefully
on decode-only drivers instead of failing.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(deps): vendor cros-codecs with GbmUsage::Linear support

Vendor cros-codecs 0.0.6 and add a GbmUsage::Linear variant to
GbmDevice::new_frame().  On drivers where neither
GBM_BO_USE_HW_VIDEO_ENCODER nor GBM_BO_USE_HW_VIDEO_DECODER is
supported for contiguous NV12 allocation (e.g. Mesa iris on Intel
Tiger Lake with Mesa ≤ 23.x), the Linear variant falls back to
GBM_BO_USE_LINEAR which is universally supported.

A [patch.crates-io] entry in the workspace Cargo.toml redirects the
cros-codecs dependency to the vendored copy.  This patch should be
removed once upstream cros-codecs ships the Linear variant.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): add GBM_BO_USE_LINEAR fallback for Tiger Lake VA-API

On Intel Tiger Lake with Mesa ≤ 23.x, both GBM_BO_USE_HW_VIDEO_ENCODER
and GBM_BO_USE_HW_VIDEO_DECODER flags are unsupported for contiguous
NV12 buffer allocation, causing VA-API H.264 and AV1 encoding to fail
with 'Error allocating contiguous buffer'.

Add a three-level GBM usage probe to both encoders:
  1. GBM_BO_USE_HW_VIDEO_ENCODER  (optimal tiling)
  2. GBM_BO_USE_HW_VIDEO_DECODER  (decoder-tiled fallback)
  3. GBM_BO_USE_LINEAR            (universal fallback)

Also update the decoder allocation callbacks to try LINEAR when DECODE
fails, ensuring decode also works on affected drivers.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* chore(reuse): add SPDX annotation for vendored cros-codecs

Cover the vendored cros-codecs directory with BSD-3-Clause (ChromiumOS
Authors) in REUSE.toml so the reuse-compliance-check CI job passes.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* chore(reuse): add BSD-3-Clause license text for vendored cros-codecs

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(deps): add GbmUsage::Separated for per-plane R8 VA-API export

On Mesa iris (Tiger Lake), gbm_bo_create rejects the NV12 fourcc with
every usage flag (HW_VIDEO_ENCODER, HW_VIDEO_DECODER, LINEAR).

Add a GbmUsage::Separated variant that bypasses native NV12 allocation
entirely: each plane is allocated as a separate R8 buffer with LINEAR,
then exported to VA-API via a multi-object VADRMPRIMESurfaceDescriptor
(one DMA-BUF FD per plane).

Changes to the vendored cros-codecs:
- GbmUsage::Separated enum variant
- new_frame(): when usage is Separated, take the per-plane R8 path
  even for formats that are normally contiguous (NV12)
- GbmExternalBufferDescriptor: store Vec<File> + object_indices instead
  of a single File, so multi-BO frames can be exported
- to_native_handle(): handle both single-BO and multi-BO frames,
  creating the correct num_objects / object_index mapping

Changes to the encoder/decoder nodes:
- Four-level GBM probe: Encode → Decode → Linear → Separated
- Decoder alloc callbacks: Decode → Linear → Separated fallback

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(deps): use single flat R8 BO for GbmUsage::Separated

The previous multi-BO approach (one R8 BO per plane) failed on Intel
iHD because vaCreateSurfaces rejected the multi-object
VADRMPRIMESurfaceDescriptor for NV12.

Switch to a single oversized R8/LINEAR buffer that is tall enough to
hold all planes end-to-end (height = coded_height × 3/2 for NV12).
The NV12 plane pitches and offsets are computed manually from the R8
stride and stored in a new SeparatedLayout struct on GbmVideoFrame.

This gives us a single DMA-BUF FD → single-object VA-API import, which
is the same proven path that contiguous NV12 allocations use.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): pass display resolution to VA-API encoders (fixes #292)

The AV1 encoder was passing only the superblock-aligned coded resolution
to cros-codecs, which set render_width/render_height in the AV1 frame
header to the coded dimensions.  For non-aligned inputs (e.g. 1280×720
→ coded 1280×768), decoders would show 48 pixels of black padding at
the bottom.

Add a display_resolution field to the vendored cros-codecs AV1
EncoderConfig and use it for render_width/render_height in the frame
header predictor.  When display differs from coded dimensions, the AV1
bitstream now signals render_and_frame_size_different=1 so decoders
crop the superblock padding.

For H.264, the SpsBuilder::resolution() method already handles
macroblock alignment and frame_crop offsets automatically, but we were
passing the pre-aligned coded resolution, bypassing the cropping logic.
Now we pass the original display resolution and let SpsBuilder compute
the correct frame_crop offsets.

Closes #292

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* refactor(nodes): use VA-API Image API for encoders, drop GBM encoder path

Replace GBM buffer allocation (GbmVideoFrame + GBM_BO_USE_HW_VIDEO_ENCODER)
with direct VA surface creation + Image API upload (vaCreateImage/vaPutImage)
for both AV1 and H264 VA-API encoders.

This bypasses the GBM NV12 allocation that Mesa's iris driver rejects on
Intel Tiger Lake, eliminating the need for the vendored GbmUsage::Linear
and GbmUsage::Separated workarounds.

Changes:
- Add open_va_display() helper (VA-only, no GBM device needed)
- Add write_nv12_to_va_surface() with bounds-check error handling (#291)
- Encoder type aliases use Surface<()> instead of GbmVideoFrame
- Encoder structs drop gbm/gbm_usage fields
- Encoder::encode() creates VA surfaces and uploads via Image API
- Revert vendored gbm_video_frame.rs to upstream (drop Linear/Separated)
- Simplify decoder alloc callbacks to GbmUsage::Decode only
- Update Cargo.toml vendor comment (now only for display_resolution #292)

Decoders remain GBM-backed (GBM_BO_USE_HW_VIDEO_DECODER works on all
tested hardware including Tiger Lake).

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): restore CrosFourcc and WriteMapping imports for vaapi tests

These types are used by write_nv12_to_mapping (decoder helper) and
nv12_fourcc(), which are still needed even after switching encoders
to the Image API path. The test module's MockWriteMapping also
implements the WriteMapping trait.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): use direct backend construction for VA-API encoders

Replace new_vaapi() with VaapiBackend::new() + new_av1()/new_h264()
construction. Surface<()> does not implement the VideoFrame trait
required by new_vaapi(), so we construct the backend directly — the
same pattern used by cros-codecs' own tests.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(deps): make new_av1/new_h264 public in vendored cros-codecs

These constructors are needed for direct backend construction (bypassing
new_vaapi() which requires VideoFrame trait bounds that Surface<()>
doesn't satisfy).

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* refactor(nodes): minimize cros-codecs vendor to pub fn new_h264 only

- AV1 encoder: standard GbmVideoFrame + new_vaapi() path (no vendor changes)
- H264 encoder: Surface<()> + Image API + new_h264() (bypasses GBM on Tiger Lake)
- Revert all vendor changes except one-word visibility: fn new_h264 -> pub fn new_h264
- Remove VaSurface newtype (infeasible due to Send+Sync constraint)
- Remove display_resolution from vendored AV1 EncoderConfig
- Remove pub on new_av1 (not needed, AV1 uses new_vaapi())
- Update Cargo.toml and REUSE.toml comments to reflect minimal patch

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* feat(nodes): replace cros-codecs H.264 encoder with custom VA-API shim

Replace the cros-codecs StatelessEncoder for H.264 encoding with a custom
VA-API shim (vaapi_h264_enc.rs) that drives cros-libva directly.  This
eliminates the need for vendoring cros-codecs entirely.

The custom encoder:
- Uses the VA-API Image API (vaCreateImage/vaPutImage) to upload NV12
  frames, bypassing GBM buffer allocation which Mesa's iris driver
  rejects for NV12 on some hardware (e.g. Intel Tiger Lake with
  Mesa <= 23.x).
- Implements IPP low-delay prediction (periodic IDR + single-reference
  P frames) with CQP rate control.
- Constructs H.264 parameter buffers (SPS/PPS/slice) directly via
  cros-libva's typed wrappers.
- Auto-detects low-power vs full encoding entrypoint.
- Handles non-MB-aligned resolutions via frame cropping offsets.

The H.264 decoder and AV1 encoder/decoder continue to use cros-codecs
0.0.6 from crates.io (no vendoring, no patches).

Removes:
- vendor/cros-codecs/ directory (~50k lines, 229 files)
- [patch.crates-io] section from workspace Cargo.toml
- REUSE.toml vendor annotation

Closes #291 (bounds-check errors already fixed in prior commits)
Refs #292 (H.264 resolution padding handled by frame cropping)

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* chore(reuse): remove unused BSD-3-Clause license file

The BSD-3-Clause license was only needed for the vendored cros-codecs
directory, which has been removed in the previous commit.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* style: apply rustfmt to vaapi_h264 encoder shim

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): fix H264EncFrameCropOffsets clone and remove unused imports

H264EncFrameCropOffsets in cros-libva 0.0.12 does not derive Clone.
Reconstruct it from field values instead of cloning.

Remove unused imports: GbmDevice, ReadMapping, CrosVideoFrame.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): restore VideoFrame and ReadMapping trait imports for decoder

These traits must be in scope for the decoder to call get_plane_pitch()
and map() on Arc<GbmVideoFrame>.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): generate SPS/PPS NALUs for custom H.264 VA-API encoder

Some VA-API drivers (notably Intel iHD) do not auto-generate SPS/PPS
NAL units in the coded output.  The cros-libva crate does not expose
packed header buffer types (VAEncPackedHeaderParameterBuffer /
VAEncPackedHeaderDataBuffer), so we cannot request them via the VA-API.

Without SPS/PPS in the bitstream, the fMP4 muxer falls back to
placeholder parameter sets (Baseline profile, 4 bytes) that do not
match the actual Main profile stream — causing browsers to reject the
decoded output and disconnect after the first segment.

Fix: on IDR frames, check whether the coded output already contains
SPS (NAL type 7) and PPS (NAL type 8).  If not, generate conformant
SPS/PPS NALUs from the encoder parameters using a minimal exp-Golomb
bitstream writer, and prepend them to the coded data.

Includes unit tests for the BitWriter (bits, ue, se) and for the
bitstream_contains_sps_pps scanner.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): address code review findings for VA-API H.264 encoder

- [P1] Pipeline validation: fail when PIPELINE_REQUIRE_NODES=1 is set
  and a required node is missing (prevents false-green CI runs); also
  panic on unreachable schema endpoint under the same flag. Set the
  env var in both CI pipeline-validation jobs.

- [P3] Fix H.264 CPU fallback messages: decoder now says 'no CPU H.264
  decoder is currently available' (none exists); encoder points to
  video::openh264::encoder (the only software H.264 encoder).

- Fix unsafe aliasing in MockWriteMapping test mock (vaapi_av1.rs):
  replaced RefCell round-tripping with raw-pointer storage matching
  upstream GbmMapping pattern, with proper SAFETY comments.

- Deduplicate I420-to-NV12 conversions: extracted shared
  i420_frame_to_nv12_buffer() into video/mod.rs, removed duplicate
  implementations from nv_av1.rs and vulkan_video.rs.

- Remove dead accessors on VaH264Encoder (display(), width(), height())
  — only coded_width()/coded_height() are used.

- Add debug_assert for NV12 packed-layout assumption in
  write_nv12_to_va_surface (stride == width contract).

- Fix endian-dependent fourcc: replace u32::from_ne_bytes(*b"NV12")
  with nv12_fourcc().into() matching vaapi_av1.rs.

- Fix scratch surface pool: return old reference frame surfaces to the
  pool instead of dropping them.

- Add idr_period documentation comment explaining the hardcoded 1024
  default and how callers can force IDR via force_keyframe.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): use crate::video path for i420_frame_to_nv12_buffer in test modules

super:: inside nv_av1::tests and vulkan_video::tests resolves to
the nv_av1/vulkan_video module, not the video parent module where
i420_frame_to_nv12_buffer lives.  Use the fully qualified crate path.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* refactor(nodes): consolidate VA-API decode loop + Vulkan Video encoder nits

Finding #5: Extract generic vaapi_decode_loop_body<D>() and
vaapi_drain_decoder_events<D>() in vaapi_av1.rs, parameterised on
StatelessVideoDecoder codec type.  Both vaapi_h264_decode_loop and
vaapi_av1_decode_loop now delegate to these shared helpers, removing
~130 lines of near-identical code.  The AV1 decode loop init is
simplified to use the existing open_va_and_gbm() helper.

Finding #8: Add comment block explaining why the Vulkan Video H.264
encoder does not use StandardVideoEncoder / spawn_standard_encode_task
(no flush(), eager device pre-init, different dimension-change model).

Finding #9: Remove redundant init_vulkan_encode_device() call inside
the dimension-change block — the Vulkan device is pre-initialised and
never cleared, so we use it directly instead of cloning through the
init helper.  Also removes the now-unnecessary device re-assignment.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* fix(nodes): restore libva import removed during decode loop consolidation

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

* ci(e2e): drop PIPELINE_REQUIRE_NODES from non-GPU pipeline validation

The non-GPU Pipeline Validation job builds skit without nvcodec or
vulkan_video features, so GPU-only test pipelines (nv_av1_colorbars,
vulkan_video_h264_colorbars) are never available.  With
PIPELINE_REQUIRE_NODES=1 this caused hard failures instead of skips.

The GPU runner (pipeline-validation-gpu) already runs ALL pipeline
tests with PIPELINE_REQUIRE_NODES=1 and all features enabled, so
node registration regressions are still caught.  Without the flag
the non-GPU runner gracefully skips GPU pipelines while still
validating all SW codec pipelines.

Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-Authored-By: Claudio Costa <cstcld91@gmail.com>

---------

Signed-off-by: Devin AI <devin@streamkit.dev>
Signed-off-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: StreamKit Devin <devin@streamkit.dev>
Co-authored-by: Claudio Costa <cstcld91@gmail.com>
Co-authored-by: staging-devin-ai-integration[bot] <166158716+staging-devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: streamkit-bot <registry-bot@streamkit.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant